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Preface 


The two volumes of Readings in Mathematical Psychology, of which this is the 
first, are designed as source materials to accompany the three-volume Handbook of 
Mathematical Psychology. The Handbook authors were asked to suggest journal 
references that they considered particularly important in their fields; from these 
suggestions the articles appearing in the Readings were selected. Because of space 
limitations and our own evaluations, we took considerable liberty in the selection 


process. | 
This volume focuses on two main areas of psychology: psychophysics and learning. 


Part I consists of 14 papers on measurement, psychophysics, and reaction time, 
and Part Il consists of 21 papers on learning and related mathematical and statistical 
topics. These papers are referenced in Chapters 1-6 and 8-10 of the Handbook. 
Volume Il of the Readings contains papers relevant to other Handbook chapters. 

Papers that have appeared .j-hard-cover publications, suchas Decision Proc- 
esses (Wiley, 1954) and Stwdicsein Mifhliématical Learning Theorj®tSfanford, 1959), 
were intentionally excluded from the present Reallingh SH is our view that every 
mathematical psychologist should have such books on his bookshelf. They are listed 
after the preface to Volume I of the Handbook. 

Of the 35 papers reproduced in this volume, Il are from Psychometrika, 10 are 
from Psychological Review, 3 from the Journal of Experimental Psychology, 3 from the 
Journal of the Acoustical Society of America, 2 from the Pacific Journal of Mathematics, 
and one each from the Bulletin of Mathematical Biophysics, the Proceedings of the 
National Academy of Sciences, Transactions of the Institute of Radio Engineers, the 
Journal of S. ymbolic Logic, the Annals of Mathematical Statistics, and a private docu- 


ment of the U.S. Air Force. Gratitude is expressed for permissions to reproduce 


these papers here. 
The 35 papers represent the work of 30 different contributors. It may be of 


interest to note that 17 of these are professional psychologists, 8 are mathematicians 
or statisticians, 3 are engineers, and 2 are philosophers. One of the papers was 
published in 1947, and the others are rather uniformly spread over the years 1950-1962. 

The compilation of a book of this sort requires a surprising amount of corre- 
For handling this and other details, the editors wish to thank Miss Ada 


spondence. 
Katz. 
R. DUNCAN LUCE 
Philadelphia, Pennsylvania RoBERT R. BusH 
March, 1963 EUGENE GALANTER 


Contents 


PART I 


MEASUREMENT, PSYCHOPHYSICS, AND 
REACTION TIME 


An Axiomatic Formulation and Generalization of 
Successive Intervals Scaling 
by Ernest Adams and Samuel Messick 


Decision Structure and Time Relations in 
Simple Choice Behavior 
by Lee S. Christie and R. Duncan Luce 


Psychoacoustics and Detection Theory 
by David M. Green 


Some Comments and a Correction of 
“Psychoacoustics and Detection Theory” 
by David M. Green 


On the Possible Psychophysical Laws 
by R. Duncan Luce 


Multivariate Information Transmission 
by William J. McGill 


Random Fluctuations of Response Rate 
by William J. McGill 


Sensitivity to Changes in the Intensity of White Noise and 
Its Relation to Masking and Loudness 
by George A. Miller 


The Magical Number Seven, Plus or Minus Two: 
Some Limits on Our Capacity for Processing Information 
by George A. Miller 


Remarks on the Method of Paired Comparisons: 

J. The Least Squares Solution Assuming Equal Standard 
Deviations and Equal Correlations 

by Frederick Mosteller 


vii 


17 


41 


67 


69 


84 


104 


129 


132 


152 


viii 


PART II 


CONTENTS 


Theoretical Relationships anong Some Measures of 
Conditioning 
by Conrad G. Mueller 


The Theory of Signal Detectability 
by W. W. Peterson, T. G. Birdsall, and W. C. Fox 


Foundational Aspects of Theories of Measurement 
by Dana Scott and Patrick Suppes 


Models for Choice- Reaction Time 
by Mervyn Stone 


LEARNING AND STOCHASTIC PROCESSES 


Statistical Inference about Markov Chains 
by T. W. Anderson and Leo A. Goodman 


A Stochastic Model for Individual Choice Behavior 
by R. J. Audley 


A Mathematical Model for Simple Learning 
by Robert R. Bush and Frederick Mosteller 


A Model for Stimulus Generalization and Discrimination 
by Robert R. Bush and Frederick Mosteller 


Two-Choice Behavior of Paradise Fish 
by Robert R. Bush and Thurlow R. Wilson 


Toward a Statistical Theory of Learning 
by W. K. Estes 


Statistical Theory of Spontaneous Recovery and Regression 
by W. K. Estes 


A Theory of Stimulus Variability in Learning 
W. K. Estes and C. J. Burke 


Analysis of a Verbal Conditioning Situation in Terms of 
Statistical Learning Theory 
by W. K. Estes and J. H. Straughan 


An Investigation of Some Mathematical Models for Learning 


by Curt F. Fey 


A Functional Equation Analysis of Two Learning Models 
by Laveen Kanal 


159 


167 


212 


228 


241 


263 


278 


289 


300 


308 


322 


502 


343 


353 


360 


CONTENTS 


The Asymptotic Distribution for the Two-Absorbing-Barrier 
Beta Model 

by Laveen Kanal 

Some Random Walks Arising in Learning Models I 

by Samuel Karlin 

Some Asymptotic Properties of Luce's Beta Learning Model 
by John Lamperti and Patrick Suppes 

Chains of Infinite Order and Their Application to 

Learning Theory 

by John Lamperti and Patrick Suppes 

Finite Markov Processes in Psychology 

by George A. Miller 


On the Maximum Likelihood Estimate of the Shannon- Wiener 


Measure of Information 
by George A. Miller and William G. Madow 


A Statistical Description of Verbal Learning 

by George A. Miller and William J. McGill 

Ultimate Choice between Two Attractive Goals: 

Predictions from a Model 

by Frederick Mosteller and Maurice Tatsuoka 

A Theory of Discrimination Learning 

by Frank Restle 

The Role of Observing Responses in Discrimination Learning, 


Part I 
by L. Benjamin Wyckoff, Jr. 


381 


404 


413 


429 


448 


470 


498 


515 


524 


PARTI 


MEASUREMENT, PSYCHOPHYSICS, 
AND REACTION TIME 


AN AXIOMATIC FORMULATION AND GENERALIZATION 
OF SUCCESSIVE INTERVALS SCALING* 


ERNEST ADAMS 
UNIVERSITY OF CALIFORNIA, BERKELEY 
AND 
SAMUEL MESSICK 
EDUCATIONAL TESTING SERVICE 


A formal set of axioms is presented for the method of successive intervals, 
and directly testable consequences of the scaling assumptions are derived. 
Then by a systematic modification of basic axioms the scaling model is gener- 
alized to non-normal stimulus distributions of both specified and unspecified 


form. 


Thurstone’s scaling models of successive intervals [7, 21] and paired 
comparisons [17, 24] have been severely criticized because of their dependence 
upon an apparently untestable assumption of normality. This objection 
was recently summarized by Stevens [22], who insisted that the procedure 
of using the variability of a psychological measure to equalize scale units 
“smacks of a kind of magic—a rope trick for climbing the hierarchy of scales. 
The rope in this case is the assumption that in the sample of individuals 
tested the trait in question has a canonical distribution, (e.g., ‘normal’) 
... . There are those who believe that the psychologists who make assump- 
tions whose validity is beyond test are hoist with their own petard --- ." 
Luce [13] has also viewed these models as part of an “extensive and unsightly 
literature which has been largely ignored by outsiders, who have correctly 
condemned the ad hoc nature of the assumptions.” 

Gulliksen [11], on the other hand, has explicitly discussed the testability 
of these models and has suggested alternative procedures for handling data 
which do not satisfy the checks. Empirical tests of the scaling theory were 
also mentioned or implied in several other accounts of the methods [e.g., 
8, 9, 12, 15; 21, 29]; Criteria of goodness of fit have been presented [8, 18], 
which. if met by the data, would indicate satisfactory scaling within an 
acceptable error. Random errors and sampling fluctuations, as well as sys- 
tematic deviations from scaling assumptions, are thereby evaluated by these 

*This paper was written while the authors were attending the 1957 Social Science 
Research Council Summer Institute on Applications of Mathematics in Social Science. 
The research was supported in part by Stanford University under Contract NR 171-034 
with Group Psychology Branch, Office of Naval Research, by Social Science Research 
Council, and by Educational Testing Service. The authors wish to thank Dr. Patrick Suppes 


for his interest and encouragement throughout the writing of the report and Dr. Harol 
Gulliksen for his helpful and instructive comments on the manuscript. 


This article appeared in Psychometrika, 1958, 23, 355-368. Reprinted with permission. 
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over-all internal consistency checks. However, tests of the scaling assumptions, 
and in particular the normality hypothesis, have not yet been explicitly 
derived in terms of the necessary and sufficient conditions required to satisfy 
the model. Recently Rozeboom and Jones [20] and Mosteller [16] have 
investigated the sensitivity of successive intervals and paired comparisons, 
respectively, to a normality requirement, indicating that departures from 
normality in the data are not too disruptive of scale values with respect to 
goodness of fit, but direct empirical consequences of the assumptions of the 
model were not specified as such. 

The present axiomatic characterization of a well-established sealing model 
Was attempted because of certain advantages which might accrue: (a) an 
ease of generalization that follows from a precise knowledge of formal prop- 
erties by systematically modifying axioms, and (b) an ease in making com- 
parisons between the properties of different models. The next section denls 
with the axioms for successive intervals and serves as the basis for the ensuing 
section, in which the model is generalized to non-normal stimulus distribu- 
tions. One outcome of the following formalization which should again be 
highlighted is that the assumption of normality has directly verifiable con- 
sequences and should not be characterized as an untestable supposition. 


Thurstone’s Successive Intervals Scaling Model 


The Experimental Method 


In the method of successive intervals subjects are presented with a 
set of n stimuli and asked to sort them into I: ordered categories with respect 
to some attribute. The proportion of times fi that a given stimulus s is 
Placed in category 7 is determined from the responses. If it is assumed that a 
category actually represents a certain interval of stimulus values for a subject, 
then the relative frequency with which a given stimulus is placed in a par- 
ticular category should represent the probability that the subject estimates 
the stimulus value to lie within the interval corresponding to the category. 
This probability is in turn simply the area under the distribution curve inside 
the interval. So far scale values for the end points of the intervals are unknown, 
but if the observed probabilities for a given stimulus are taken to represent 
areas under a normal curve, then scale values may be obtained for both the 
category boundaries and the stimulus. 

Scale values for interval boundaries are determined by this model, 
and interval widths are not assumed equal, as in the method of equal appearing 
intervals. Essentially equivalent procedures for Obtaining successive intervals 
scale values have been presented by Saffir [21], Guilford [10], Mosier [15], 
Bishop [3], Attneave [2], Garner and Hake [9], Edwards [7], Burros [5], and 
Rimoldi [19]. The basic rationale of the method had been previously outlined 
by Thurstone in his absolute scaling of educational tests [23, 26]. Gulliksen 
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[12], Diederich, Messick, and Tucker [6], and Bock [4] have described least 
square solutions for successive intervals, and Rozeboom and Jones [20] 
presented a derivation for scale values which utilized weights to minimize 
sampling errors. Most of these papers contain the notion that the assumption 
of normality can be checked by considering more than one stimulus. Although 
one distribution of relative frequencies can always be converted to a normal 
curve, it is by no means always possible to normalize simultaneously all 
of the stimulus distributions, allowing unequal means and variances, on 
the same base line. The specification of exact conditions under which this is 
now be attempted. In all that follows, the problem of sampling 


possible will 
ignored, and the model is presented for the errorless 


fluctuations is largely 
case. 


The Formal Model 

The set of stimuli, denoted 3S, has elements Tr, 8, wu, v, *** . There is no 
limit upon the admissible number of stimuli, although for the purpose of 
testing the model, S§ must have at least two members. For each stimulus 
sin S, and each category 7 = 1, 2, :-- , hk, the relative frequency fei with 


which stimulus 8 is placed in category 7 is given. Formally f is a function 


from the Cartesian product of § X 1,2, , k} to the real numbers. More 
specifically, it will be the case that for cach s in 8, f. will be a probability 
distribution over the set {1, 2, --- , k}. For the sake of an explicit statement 
of the assumptions of the model, this fact will appear as an axiom, although 
it must be satisfied by virtue of the method of determining the values of f..; . 


Axiom 1. fis a function mapping S X {l,., k} into the real numbers 


such that for each s in S, f, is a probability distribution over {1, -- , Kk}; l.e., 
foreachsinSandi=l,.-,hk,0S tf. sland BOE fa = di 

The set S and the function f constitute the observables of the model. 
‘Two more concepts which are not directly observed remain to be introduced. 
The first of these is a set of numbers ti , + , tun, which are the end points 
of the intervals corresponding to the categories. It is assumed that these 
intervals are adjacent and that they cover the entire real line. Formally, 


it will simply be assumed that ti , :-- , ta-» Are an increasing series of real 
numbers. 
Axiom 2. Interval boundaries li , ‘°° » la» Are real numbers, and for 


T= 2, (k =— 1); Ton Sh 
Jinally, the distribution corresponding to each stimulus s in S is repre- 


sented by a normal distribution function N.. 
Axion 3. N is a function mapping S into normal distribution functions 


over the real line. 
Axioms 1-3 do not state fully the mathematical properties required for 
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the set S, the numbers t, , -- » la-» , and the functions N, . In the interests 
of completeness, these will be stated in the following Axiom 0, which for 
formal purposes should be referred to instead of Axioms 1-3. 


Axiom 0. Sis a non-empty set. fk is a positive integer. f is a function 
mapping S X {1, ... , k} into the closed interval [0, 1], such that for each s 
5, Sera fa = Le Eos y= lL ---,(k — 1), t; is a real number, and for 
L= lene, (B=—Iid < ti. . N is a function mapping 5S into the set of 
normal distribution functions over the real numbers. 

Axioms 2 and 3 state only the set-theoretical character of the elements 
ti and N, , and have no intuitive empirical content. The central hypothesis 
Of the theory states the connection between the observed relative frequencies 
f..; and the assumed underlying distributions N, . 


Axion 4. (Fundamental hypothesis) For each s in Sand = Ly 56 5 Bj 


fas = LL N Ao) da: 


(Note that if 5 = 1, li_1) is set equal to — 65, ADd if T= hi tb = 53) 
Axioms 1-4 state the formal assumptions of the theory although, because 
the fundamental hypothesis (Axiom 4) involves the unobservables N, and 
ti, it is not directly testable in these terms. The question of testing the model 
will be discussed in the next section. Scale values for the stimuli have not 
yet been introduced. These are defined to be equal to the means of the distri- 


butions N, , and hence are easily derived. The function v will represent the 
Scale values of the stimuli. 


DEFINITION 1. v is the function mapping S into the real numbers such 
that for each s in S, v, is the mean 0. N. ; jie, 


Vv, = fl aN (a) de. 
Testing the Model 


The model will be said to fit exactly if all of the testable consequences 
of Axioms 1-4 are verified. Testable consequences of these axioms will be 
those consequences which are formulated solely in terms of the observable 
concepts S and f, or of concepts which are definable in terms of S and f. 
If no further assumptions are made about an independent determination of 


th, , ta and N, then the testable consequences are just those which 
follow about f and S from the assumption that there exist numbers 
ti, , tay and functions N, which satisfy Axioms 1-4. In this model, 


it is possible to give an exhaustive description of the testable consequences; 
hence this theory is axiomatizable in the sense that it is possible to formulate 
observable conditions which are necessary and sufficient to insure the existence 
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of the numbers t; and functions N, . The derivation of these conditions will 


proceed by stages. 
Let p.., be the cumulative distribution of the function f for stimulus s 


and interval 2. 


DEFINITION 2. Foreachsin Sandi= 1. ,k, 


De.i = SX - 


ml 


It follows from this definition and Axiom 4 that for each s in S and 
i= lc ,k, 


0) ps = | N0) ao. 


Using the table for the cumulative distribution of the normal curve with 
zero mean and unit variance, the numbers z,.;, may be determined such that 


0 genie foe ts 


(Note that for 5 = hk, 2, ,; will be infinite.) N, is a normal distribution function 
and must have the form: 


1 1 2 
fT = ৰ ee (a = Y} 
(3) NG) = exp [ 20% (a —v,) |: 
where c? is the variance of N, about its mean v, . Equations (1), (2), and (3) 
yield the conclusion that for each sin S and + = 1, ,k, 
4) 25. = (ti —v)/o; - 


In (4) the numbers z2,,, on the left are known transformations of the 
observed proportions f,,; , while the numbers t; , v, and oc, are unknown. 
Suppose however that r is a fixed member of the class 8S of stimuli; it is 
possible to solve (4) for all the unknowns in terms of the known 2's, and 
v, and oc, , the mean and standard deviation of the fixed stimulus r. These 
solutions are 


(5) fi = ane te OE t= ly tet =— dD) 
(es) for sSeS, and 17; 


1 
2s, = 2,5 


FE 1b, 
0) v ol = — 24), | +. 
The necessary and sufficient condition that the system of equations (4) 
have a solution, and hence that t; , v, and co, be determinable using (5), (6), 


(6) Gi = 0G 
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and (7), is that all z,,; be linear functions of each other in the following sense. 
For all r and s in S, there exist real numbers a, ., and b, + Such that for each 
EE Ls Aes 


(8) Bit Ota Tbe 
The required numbers a, ., and b, ., exist if and only if for each r and s, the ratio 
(9) Bent Bi Ve 1 

2 2,,;j a, 


is independent of 7 and Jj. 

If constants a,,. and b,,, satisfying (8) exist, then they are related to 
the scale values v, and the standard deviations co, in a simple way. For each 
fis nS, 


(10) G., = o0/o,; 
and 
(11) b,., = Yb, — v.)/c, . 


Clearly the arbitrary choice of the constants v, and c, in (5), (6), and (7) 
represents the arbitrary choice of origin and unit in the scale. Since scale 
values of t; and v, are uniquely determined once v, and c, are chosen, the 
scale values are unique up to a linear transformation; i.e., an interval scale 
of measurement has been determined. It should be noted that this model 
does not require equality of standard deviations (or what Thurstone has 
called discriminal dispersions [25]) but provides for their determination 
from the data by equation (6). This adds powerful flexibility in its possible 
applications. 

It remains only to make a remark about the necessary and sufficient 
condition which a set of observed relative frequencies f,., must fulfill in 
order to satisfy the model. This necessary and sufficient condition is simply 
that the numbers z,., , which are defined in terms of the observed relative 
frequencies, be linearly related as expressed in (8). This can be determined 
by seeing if the ratios computed from (9) are independent of t and j, or by 
evaluating for all s, r the linearity of the plots of 2,,; Against z2,,, . Hence 
for this model there is a simple decision procedure for determining whether 
Or not a given set of errorless data fits. 

If z,,, and z, ,, are found to be linearly related forall s, rin S, the assump- 
tions of the scaling model are verified for that data. If the 2’s are not linearly 
related, then assumptions have been violated. For example, the normal curve 
may not be an appropriate distribution function for the stimuli and some other 
function might yield a better fit [cf. 11; 12]; Or perhaps the responses cannot 
be summarized unidimensionally in terms of projections on the real line 
representing the attribute [11]. If the stimuli are actually distributed in a 
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multidimensional space, then judgments of projections on one of the attributes 
may be differentially distorted by the presence of variations in other dimen- 
sions. ‘This does not mean that stimuli varying in several dimensions may 
not be sealed satisfactorily by the method of successive intervals, but rather 
that if the model does not fit, such distortion effects might be operating. 
A multidimensional scaling model [14] might prove more appropriate in 
Such censes. 

In practice the set of points (2, , Bea) LOL = By 4 (Rt =— I) HL 
never exactly fit the straight line of (8) but will fluctuate about it. It remains 
to be decided whether this fluctuation represents systematic departure 
from the model or error variance. In the absence of a statistical test for 
linearity, the decision is not precise, although the linearity of the plots may 
still be evaluated, even if only by eye. One approach is to fit the obtained 
points to a straight line by the method of least squares and then evaluate 
the size of the obtained minimum error [4, 6, 12]. In any event, the test of 
the model is exact in the errorless case, and the incorporation of a suitable 
sampling theory would provide decision criteria for direct experimental 
applications. 

A Generalization of the Successive Intervals Model 

The successive intervals model discussed in the previous section can 
ff Ways. One generalization, treated in detail 
ach interval boundary ti; to be the mean ofa 
Another approach toward 


be generalized in a number c 
by Torgerson [27], considers e 
subjective distribution with positive variance. 
1g the model is to weaken the requirement of normal distributions 
of stimulus scale values. Formally, this generalization amounts to enlarging 
the class of admissible distribution functions. Instead of specifying exactly 
whieh distribution functions are allowed in the generalization, assume an 
arbitrary set ¥ of distributions over the real line, to which it is required that 
the stimulus distributions belong. In formalizing the model, y is characterized 
simply as a set of distribution functions over the real line. Axiom 3 may be 
replaced by a new axiom specifying the nature of the class ¥ and stating 
that Cis a function mapping S into clements of Y; i.e., for eachsinsS,C, 
(interpreted as the distribution of the stimulus 8s) is a member of Y. 

One final assumption about the class Y needs to be added: namely, 
if ¥ contains a distribution function ©, then it must contain all linear trans- 
formations of GC 2 isnt transformation of a distribution function (© is 
defined as any other distribution function ©" which can be obtained from 
C by a shift of origin and a seale transformation of the horizontal axis. A 
stretch along the horizontal axis must be compensated for by a contraction 
on the vertical axis in order that the transformed function also be a probability 
density function. Algebraically, these transformations have the following 
form. Let D and D’ be distribution functions, then D’is a linear transformation 


generalizi 
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of D if there exists a positive real number a and a2 real number b such that 
for all zx, 


D(z) = aD(az + Db). 


This is not truly a linear transformation because of multiplication by a on the 
ordinate, but for lack of a better term this phrase is used. The reason for 
requiring that the class y of distribution functions be closed under linear 
transformations is to insure that in any determination of stimulus scale 
values it will be possible to convert them by a linear transformation into 
another admissible set of scale values; i.e., the stimulus values obtained are 
to form an interval scale. If the set ¥ is not closed under linear transformations, 


in general it will not be possible to alter the scale by an arbitrary linear 
transformation. 


Axiom 3’. y is a set of distribution functions over the real numbers, and 
Cis a function mapping S into y. For all D in ¥, if ais a positive real number 
and b is a real number, then the function D’ such that for all x, 


D(z) = aD(az + b) 
is a member of yY. 

It is to be observed that the set of normal distributions has the required 
property of being closed under linear transformations. This set is in fact a 
minimal class of this type, in the sense that all normal distribution functions 
can be generated from a single normal distribution function by linear trans- 
formations. 

Finally, Axiom 4 is replaced by an obvious generalization which specifies 
the connection between the Observed f,,, , the distribution functions C, , 
and the interval end points t, . 


Axiom 4’. For each sin S Bd 3 = 1 zs x 


I ; C.(2) dz. 


(Here again ty = —~ and ti = ow.) The stimulus values are defined as 


before to be the means of the distribution functions C, . 


DErINIT10oN 1’. vis the function mapping S into the real numbers such 
that for each s in S, v, is the mean of C, LE, 


Ee) 


ys i _C.(0) dz. 


h ‘The problem now is to specify the class of admissible distribution func- 
tions y. Each specification of this class Amounts to a theory about the under- 
lying stimulus distributions. If the hypothesis of normality is altered or 
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weakened, what assumptions can replace it? Omitting any assumption about 
the form of the distribution functions would amount to letting ¥ be the set 
of all distribution functions over real numbers. If no assumption whatever 
is made about the forms of C, , then the theory is very weak. Every set of 
data will fit the theory, and the scale values of t; can be determined only 
on an ordinal scale. It is always possible to determine distribution functions 
C, satisfying Axiom 4’ for arbitrarily specified t; . To show this it is only 


necessary to construct them in accordance with the following definition. 
J, i . . . 

EC : 7: SE 5, 1 ইং PES 

Gs LCE? [) 3 SEO t= 1, yt; 


0 otherwise. 


Non-normal Distributions of Specified Form 

It is clearly necessary to make some restrictions on ¥ if the scale values 
are to be determined uniquely up to a linear transformation. It will next 
be shown that any minimal class of distribution functions, in the sense of 
a class all of whose members are generated from a single member by linear 
transfdrmations, has the desired property of generating a linear scale of 
stimulus values when the model fits. For the present assume that y is a 
minimal class of distribution functions. 

ASSUMPTION 1. There exists a distribution function D such that for all 
distribution functions D’ in y there exists a positive real number a and a 
real number b such that for all x, 

D'(2) = aD(az + b). 

To show that if Assumption 1 is satisfied the scale values are obtained 
on an interval scale, we proceed as follows. Axiom 3’ and Assumption 1 imply 
that for all s in S, there exists a positive real number a, and a real number 
b, such that for all zx, 

(12) C.(2) = a,Dlax + b.), 
where the function D on the right side of (12) is a fixed function of some 
specified form linearly related to all the functions D' in y. According to 


Axiom 4’, then, for each sin S, and 1 = L,---, kk, 
(03) f= | aD + Db) dr. 


If fr is the cumulative distribution corresponding to D, and the cumulative 
distributions p..; are defined as before, then 


04) D..i = le a,D(a,x + b.) dr 


(at, + b.). 
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Assuming that the function is strictly monotone increasing, then, knowing 
the form of function D, it is possible to determine uniquely the numbers 
2,.,; Such that for each sin Sand = 1, 


(15) Dep = He) 
Equations (14) and (15) imply immediately that 


(16) Zs.3 = Qt Fb 


for allsin S andz = 1, ,h.TItis clear from (15) why it is necessary 
to assume that rr is strictly monotone increasing. Tf it were not, there would 
not in general be a unique z,.,; determined by (15); hence the scale values 
based on z,,; would not be unique. Tt is also seen that (4), relating z,,, to 
li, v, and o, in the normal distribution model, is simply a particular case of 
(16) here. The connection between a, » b, And oc, and v, is 


0 = TG 5, Vv, = —b,./a, . 


In (15), as in the corresponding set of equations obtained from the 
normality assumption, the numbers on the left are known, and the numbers 
on the right are unknown. As before, if two numbers a, and b, are arbitrarily 
determined for a fixed stimulus ?, then the t, are uniquely determined by the 
following equation. 


(17) fh = th =— BOG; t= 1 


The scale values for the stimuli, however, cannot be directly determined 
from the coefficients z,,, , a, and b, without first specifying the mean m 
of the basic distribution D. If mis the mean of D, then v, , which was defined 
as the mean of C, , is determined by 


(18) v, = (m — b,.)/a, . 


Both the a, and the b, in (17) can be determined in terms of 2,., , a, and b, , 
(19) and (20); hence v, is immediately determinable in terms of just these 
quantities by (18). 


(19) aati es 
Bre 2 
(20) b,=2,,; E re 2), s bi 


f It is clear then that the scale values t; and v, are determined up to a 
linear transformation. Furthermore, necessary and sufficient conditions 
that a set of data fit the model are simply that the ratios of differences in 
0 es right in (19) be independent of i and J; i.e., that the 2's be linearly 
related. 
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The Forms of the Distributions Unspecified 


A final generalization to be considered is one in which Assumption 1 
holds, but where the form of the generating function D is not specified; 
j.e., it is assumed that the underlying distributions all belong to one minimal 
class, but that the class can be generated by any distribution function D. 
Interestingly enough, in this case it is still possible to test the model and 
to obtain more than ordinal information about the scale values. If it is 
assumed that the stimulus distributions all belong to one minimal family 
generated by a function D, but D is unknown, all of the deductions up through 
(14) go through, although in this case the function r is also unknown. Now, 
of course, it is impossible to discover the numbers 2,.; by solving (15), but 
if it is postulated that the function is strictly monotone increasing, it is 
still possible to obtain some information about the numbers (at; + b,). 
Since # is a cumulative distribution it is monotone increasing; however, it 
will only be strictly monotone increasing in case the distribution function 
D is never zero. This assumption is made explicit in Assumption 2. 


AsstMPTION 2. For all x, D(?) > 0. 
Now, if Tis strictly monotone increasing, then it follows that s(2) > (y) 


if and only if @ > y. If (14) holds, then it will be the case that for allr,s 


SANE j= Li 4h 
(21) pes 2 Des if and onlyif adtit+b>at +b. 


Therefore from an ordering on the numbers p.., one can obtain a system of 
inequalities involving the constants as , b, , and t; . Tf it is further specified 
(as is required for the conditions of the problem) that a, > 0 for all S, then 
this set of inequalities will not in general have a solution. 

However, whether or not a set of data fits the model may still he deter- 
mined. The necessary and sufficient condition for fit is that there exist numbers 
a, , ti and Db, (where a, > 0) satisfying the system of inequalities (21). Tf 
this set of inequalities has a solution, then the interval boundaries may be 
taken to be the t; satisfying (21). To determine the scale values of the stimuli 
it is first necessary to construct a distribution function which can represent 
the data. This is done in the following way. A differentiable monotone in- 
creasing function (tr) is constructed by connecting the discrete set of points 


n(la.ti + b.) = Dp.s.i 
with anv smooth, strictly monotone increasing curve. If, as is usual, there is 
li, then such a eurve ean always be constructed. 


only a finite number of stimu 
tion D is defined by 


Finally, the distribution func 


1 
(22) D(a) = Tk n(n). 
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Then, if the mean of the distribution D is m, the values v, of the stimuli 
are determined by (18), v, = (m — b,.)/a, . As far as the determination of 
the v, is concerned, it can be seen that they depend solely on the previously 
determined a and b and on the mean m, which can be regarded as an additional 
arbitrary constant in the determination of the v, . 

The remaining point of discussion for this model is the determination 
of the degree of uniqueness of the scale values. Finding the set of all possible 
solutions to the inequalities (21) presents, in general, extreme difficulty. 
One thing that can be simply determined is the class of what might be called 
the universal transformations of the solutions of the system of inequalities. 
A universal transformation is one which, applied to a solution of any set of 
inequalities, yields another solution to the same set of inequalities. By noting 
a close connection between the theory of the inequalities (21) and a two- 
dimensional affine geometry with a distinguished set of horizontal and vertical 
lines, it can be shown [1] that the class of universal transformations for this 
model is a subset of the affine transformations. The universal transformations 
of the interval boundaries t, are the linear ones, and of the a, are multiplica- 
tions by a positive constant. The b, also are determined up to a linear trans- 
formation, and hence so are the scale values v, (although the additional 
arbitrary constant m also enters into their determination). 

There is also an interesting special case in which, even though there is 
only a finite number of Observations, the scale values of the t; are determined 
up to a linear transformation. This might be called the special case of equal 
intervals, in which differences in Successive t; are all the same. If, for example, 
there exist stimuli with such relations among corresponding p’s aS pz; = 
DPu,ts1 = Dita, Dain = Diisn 1 DP,i = Dp: , etc., it is possible to deter- 
mine that successive intervals are equal [1]. | 

The fact that scale values obtained in this model, at least under certain 
circumstances, are unique up to a linear transformation has two interesting 
consequences for the original successive intervals model based on the nor- 
mality hypothesis. (i) Tf in the errorless case the original model fits, then 
no other successive intervals model which Assumes a different form for the distri 
bution functions will fit. The reason for this is that the forms of the distribution 
functions (or the cumulative distributions) are determined by the values of 
P..: lying above the point t; . Hence, if the t; are determined up to linear 
transformation, so are the curves p.. . (li) Where the normality assumption 
does not fit the data it is theoretically possible to use the present generalization 
to obtain a scale. Then the deviation of the scale values from those obtained 
under a normality requirement can be evaluated. This, at least in principle, 
provides a second kind of goodness of fit besides the usual least squares 


regression methods employed where the data do not exactly fit the Thurstone 
model. 


nl 
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DECISION STRUCTURE AND TIME RELATIONS IN 
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The structure of simple decisions is considered in terms of a model 
which composes such decisions from hypothetical elementary decisions. 
It is argued that reaction-time data can be treated by the use of the 
Laplace transform So as to overcome difficulties which negated earlier 
attempts to analyze choice reactions. The general model leads to com- 
plex problems which are formulated but not solved. Two special cases 
of the model are worked out, and the statistical problem of evaluating 
the fit of the model is discussed. It is shown that treating decision 
processing as time-discrete leaves the essential features of the analysis 
unchanged. Two experimental proposals, to provide data which should 
be considered in further work on the model, are made. 


I. Introduction. In this paper we propose & model for the way hu- 
man beings organize the decisions required by simple choice situa- 
tions into a collection of component decisions. It is our thesis 
that such an organization of decisions must be reflected in the 
distribution of reaction times and, therefore, that it may be pos- 
sible to infer the organization from the reaction-time distribution. 
Although our thinking derives from empirical studies, we must 
describe this proposal as speculative, for the model is not firmly 
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based on such studies. However, the development of the model 
has led us to suggest two experiments which we believe may help 
to determine what merit it has. These experiments will also help 
to decide whether it is desirable to pursue further work in an at- 
tempt to modify the model to accord better with reality, for we have 
little hope that the particular details of the present model have any 
lasting value. 


I. Reaction Times. Suppose that a subject receives a stimulus of 
2 fixed type at time 0 and responds at time t with a fixed type of 
response. The time interval, t, between the stimulus and the re- 
sponse is called the simple reaction time. If the subject is pre- 
sented with one of a set of stimuli and a choice of response con- 
tingent on the stimulus is required the corresponding time interval 
iS known as the disjunctive reaction time. In either case, itis 
clear that to obtain stable and readily analyzable time distribu- 
tions it is necessary that the stimulus be simple enough so that 
the mean reaction time is no more than a second or two. Otherwise 
unwanted stimuli may intervene between the test stimulus and the 
response, and the interaction among the stimuli will cause a dis- 
tortion of the time distribution which will be very difficult to 
analyze. 

The study of reaction times, including disjunctive reaction times, 
has a long history in the literature of psychology (cf. Woodworth, 
1988, chap. xiv). In recent years, however, relatively little in- 
terest has been evident in reaction-time studies. We may attribute 
this loss of interest to two related causes. First, there has been 
2 failure to separate the time to make a decision (decision latency") 
from the other time lags involved in the total process. One attempt 
to make this separation involved measuring the subject’s response 
to a stimulus when no decision was to be made and subtracting 
this time from the time required to respond to the same stimulus 
with the same motor action when a decision was involved. This 
technique has been considered unsatisfactory for the following 
reason. If the subject has no decision to make he is able to bring 
his motor readiness for the specified response to a much higher 


*We use reaction time when referring to the time of a process timed 
from stimulus presentation to motor response; 


টী টী latency when referring to 
times of distinguished parts of such a process. 
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pitch than he can when he is required to make a disjunctive re- 
action; thus, the base time— the time to react in a choice situation 
excluding the time for the decision itself—cannot be equated to 
any simple reaction time. We may conclude that the base time will 
be determined, if at all, only from measurements taken when the 
subject is required to make a decision. 

Second, suppose that in one way or another the pure decision 
latency distribution has been obtained—then what? It is true that 
if these distributions were found to be extremely simple, in that 
they could be well approximated by some class of elementary 
mathematical functions, the separation of non-choice latencies 
(base times) from decision latencies might be an end in itself. If, 
however, the resulting decision latency distribution were of a 
complex character, the challenge to account for it in more primitive 
terms would remain. 

We describe these as related difficulties, for it is not unreason- 
able to suppose that the method used to tease out the non-choice 


tencies (base times) can also be used, or adapted, to decompose 


la 
Such a decom- 


the decision latencies into more primitive terms. 
position of the observed reaction-time distribution may be an en- 
tirely formal mathematical process with no empirical correlate or it 
may be based on a model which purports to describe the way a 
human being composes the finally observed decision from certain 
more elementary ones. It is with such a model that we are concerned. 

At the heart of our proposal is the idea that the mathematical 
technique of the Laplace transform may be employed usefully in 
the study of reaction times. Since it is unlikely that every one of 


aders will be familiar with the Laplace transform, we have 


our re 
d to a list of those of 


devoted the next section to its definition an 
its elementary properties which we shall need. 


I. The Laplace Transform. Let F be a real-valued function of a 
real variable t such that F(t) =0 for t <0. The real-valued func- 
tion L(F) of the real variable 3 defined by the equation 


L(F) = | e-tF(t)dt (1) 
0 


is called the Laplace transform of F. There is essentially no loss 


of information about F in making this transformation [see equation 
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(4)], but because of some of the special properties of the transform 
there is sometimes a distinct advantage to working with trans- 
formed functions. We shall list a few of the elementary properties 
of the transform which we shall need later; no proof's will be given 
for they are well known (ctf. Churchill, 1944). 


nt 
2 “| FF, (tn) a = LF LUE, ). (2) 
(9) 
ii. L &)- SLOF) + F(0). (3) 
at 
lili. If LCF) = L(G), then F = G + N, where N is some (4) 


function with the property ls N(t)dt = 0 for all T> 0. Ifitis known 
that Fand G are continuous, the N is continuous and so N = 0d 
F=0G, 

- iv. If aand b are constants, 


L(aF + bG) = aL(F) + bL(G). (5) 
Vv. JE F(t) = Ae~?*, whereA is a constant, then 
৷ | 
L(F) = ড় & (6) 
ডক 1 


IV. The Model. Our proposal is based on assumptions which are 
intuitively acceptable, but which at the moment do not appear to be 
Susceptible of direct verification. It is our impression that any 
empirical verification of the model must deal with the full set of 
assumptions rather than with each in isolation. 

Assumption I. Itis possible, for a fiven experimental situation, 
bo divide the observed reaction time t into two latency components 
lt, and tL, called base time and choice time respectively, such that: 

1. t= EA AE 

Cc 

্রু The Value of t, depends only on the mode of stimulus BLES 
entation and on the motor actions required of the subject. Specif- 
‘cally, it is not directly dependent on the character of the choice 
demanded, 

3. The value of t. depends only on the choice demanded. Spe- 
cifically, it is not directly dependent on the mode of stimulus pres- 
entation or the motor actions required. 

Let the distributions oF %, ft, and tL be denoted by f, fy and f, 
respectively. Since conditions 29 and 3 imply that the two com- 

~~ 


4 D 


fe itary rN 
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ponent latencies are independent for a fixed experimental situa- 
tion, it follows from condition 1 that 


f(t) -/ h(Nf(t~-r)dr. (7) 
0 


Our second major assumption concerns only the choice latencies 
and requires the distribution f to be composed from more elemen- 
tary distributions. The basic idea is that the final decision made 
by a person is organized into a set of simpler decisions which are, 
in some appropriate sense, elementary decisions built into him. If 
such a structure exists in human decision making, it is analogous 
to the structure of a decision process in a computing machine, 
which may be thought of as composed from a set of decisions 
which are elementary relative to that machine, i.e., the elementary 
decision capabilities built into the machine by the engineer. Them 


of the machine, and we shall suppose it is true of human beings Dj pt 
In addition, the breakdown of a complex decision is not, in gen-3° I>; 
oral, restricted to a serial process where one elementary decisions: ft 

< 


is followed by another, for in a machine different portions may be 
simultaneously employed on different parts of the problem. There 
seems every reason to suppose this is also true in a human being. 

We shall describe the organization of decisions by a directed 
graph. (The terms oriented graph and network have also been em- 
ployed in the mathematical literature and the term flow diagram is 
used in connection with computer coding.) A directed graph con- 
sists of a finite set of points which are called nodes, with directed 
lines between some pairs of them. Several examples are shown in 
Figure 1. It is possible, in general, for more than one directed 
line to connect two points, both in the sense that we may have two 
or more in the same direction as in Figure 2a, and in the sense 
that there may be lines with opposite directions as in Figure 2b. 
In this paper, when we use the term directed raph, we shall sup- 
pose that neither of these possibilities 1s allowed, that is, we 
Shall suppose that between any pair of odes there is at most one 
directed line. ENT OES rt 
EM Ss LD 

T., We Ss 
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FIGURE 1 


decisions in the following way: At each node we shall assume that 
an “‘elementary decision” will take place, the latency distribution 
Eoverning the decision at node i being denoted by fi. The decision 
process is initiated at node i when, and only when, decisions have 
been made at each of those nodes j such that there is a directed 
line from j to i. We may think of the ‘‘demon”’ at node i waiting to 
begin making his decision until he has received the decisions of 


For the directed graphs we shall Consider, there will be at least 
one node, possibly more, which is the terminal point of no line; 
these will be the decision Points which are activated by the ex- 
perimental stimulus at time 0. There will also be at least one 
node, and again possibly more, which initiates no directed line, 


"+ 
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FIGURE 2 
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and it is only when the decisions at all these nodes have been 
taken that the motor actions, which signal the subject’s response 
to the experimenter, are begun. It is clear that for any individual 
and for any stimulus situation it is possible to find at least one 
directed graph N and elementary latencies f; Which compose as 
described above to give f,.. For example, let N have but one node 
and let f= th. We shall, however, make stringent assumptions 
about N and f. which, in general, exclude this trivial solution. It 
is some of these assumptions which most likely will be abandoned 
or modified if the present model cannot cope with experimental 
data. 

Assumption Il. It is possible to find for each stimulus situation, 
0c, a set of stimulus situations, S, which all have the same base- 
time distribution, f,, and an elementary decision latency, f., such 
that: 

1. 0 is an element of S$. 

9, For each choice situation p in § there exists a directed graph 
N with the properties, 

a. each of the latency distributions at the nodes is the same, 
namely, fos 

b. the decision time at node i is independent of that at node 
lit, 

Cc. fLisa composition of N, and f, (as described above). 

3. Among the stimulus situations in § there is one whose di- 
rected graph satisfying conditions IL.2is a single point. 

In less formal terms, we require that there be groups of stimulus 
situations all of which have the same base-time distribution and 
which can be built up according to a directed graph from elemen- 
tary and independent decisions which all have the same latency 
distribution f,. In addition, among the stimulus situations in this 
class we assume that there is one which employs but a single ele- 
mentary decision. The latter assumption can be weakened, if we 
choose, to the assumption that there iS one stimulus situation 
whose directed graph we know a8 priori, but in what follows we 
shall take the stronger form that the graph is a8 single point. 


V. Comments. The above assumptions comprise the formal struc- 
ture of our model; there are a series of auxiliary comments which 
are necessary. 


24 READINGS IN MATHIMATICAL PSYCHOLOGY 


Even if we were able to show that these assunptions can be met 
for certain wide classes of experimental data, but that in so doiny 
we obtain elementary decision ‘listributions f, Which are extremely 
complicated, it is doubtful that we should accept the model as an 
adequate description of the decision process. Lqually well, if the 
directed praphs required are excessively complex we should reject 
the model. The hope is that it is Possible to subdivide the total 
Process into a relatively small set of Subprocesses which are 
practically identical. But we do not want to be forced to an analy- 
Sis in terms of individual neurone firings. It is probable that As- 
Sumption IIL.3 effectively prevents this extremity by requiring the 
existence of a stimulus situation which domands but one clemen- 
tary decision for its response. 

It is also implicit in our thinkiny, although not a part of the 
formal model, that the sets S of “similar” stimulus situations will 
include as subsets those experimental situations we naturally 
think of as being similar. For example, suppose the subject is 
presented with n points, one of which is colored differently from 
the others and he is required to Signal the location of that one. 
We should want to Consider as ‘“‘similar” the set of these situa- 
lions generated as n ranges over the smaller integers. We should 
probably reject the model if they could not be put in the same set 
S, even if by great ingenuity we were able to find other less in- 
tuitively simple sets of Situations for which tho model held. 

When the model is applied to experimental data we anticipate 
that the case of the directed graph being 
identified with the intuitively “simplest” 
the set of “similar” ones. 

In some of the following sections w 
explicit Assumption as to the form of he: 


a single point will be 
choice situation within 


e shall make the following 


AD) a £20, 
2 8<0, 


There are two grounds for sup- 
te assumption. First, let us sup- 
been reached by time t following 
oUgbility that the decision will be 
here At is Small, is approximately 
tant of Proportionality A. In this 
that the distribution of decisions 
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is exponential (Christie, 1952a,b). Whether this assumption is 
correct is an empirical problem, but it must be admitted that it has 
the virtue of simplicity. Second, and probably more relevant, it is 
a relatively common observation that as certain decision situations 
are made more and more simple, the observed latency is better and 
better approximated by an exponential distribution slightly dis- 
placed from the origin (Christie, 1952b; Luce, 1953). The main 
error is generally on the rising limb. If this change toward sim- 
plicity is actually toward a directed graph consisting of one point, 
and if our other assumptions hold, then it seems plausible that the 
elementary decision latency is actually exponential but that the 
observed distribution is smeared by the convolution of the base- 
time distribution and the decision-time distribution. 


VI. The Problem. Let § be a set of choice situations which are 
presumed to satisfy the assumptions of the model, i.e., S$ is a set 
of the type described in Assumption Il. Let f, denote the reaction- 
time distribution associated with a typical member of S. The 
problem is then to find distributions f, and f, and a set of directed 
graphs N,, Where co ranges over S, such that each of the triples 
(fh, fe» No) When composed according to the assumptions of Section 
IV yields the distribution fy. There may, of course, be no, one, or 
many solutions to the problem, but one hopes that by an appropriate 
choice of § there will be exactly one solution. 

It would appear that if the problem is to be solved in any degree 
of generality, it must be attacked somewhat indirectly. It may 
prove appropriate to solve first the following problem: Given a 
continuous distribution f, find the set of all triples (és fs N), 
where ff, and f, are continuous, which satisfy the assumptions and 
which compose to form f. It seems very plausible to suppose that, 
in general, there are many solutions to this problem. However, if 
f and f’ are two distributions associated with choice situations 
from the same set 5S, then it will be necessary to accept only those 
triples with the same f, and f, present in both cases. Further 
stimulus situations should serve further to restrict the possibilities. 

These problems will not be attacked, let alone Solved, in this 
paper; they appear to be of considerable difficulty. We know of 
only one important lead in this direction, but we have not investi- 
gated it. In recent years, electrical engineers have been concerned 
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with the problem of synthesizing in a systematic manner electrical 
networks to have preassigned trans fer functions. If we identify the 
given reaction-time distribution with the transfer functions, the 
graph N with the electrical network, and f, With component char- 
acterislic, there is an analogy between the two problems. This is 
probably worth investigation, but it is almost certain that solving 
our problem will prove to be a major research undertaking. 

To some extent the problem we Pose may be simplified by using 
Some of our assumptions and the Laplace transform. Let fz be the 
Observed distribution of reaction times for a given stimulus situa- 
tion oc, then by Assumption II we know there exists a set S§ which 
includes o and another stimulus situation whose directed graph 
consists of one point. Let f, denote the distribution of reaction 
times in the latter case. From Assumption I we may write 


AC) -[ AOIAC ~1)dr, 
9 (8) 

A(t) = AMf(t-r)dr. 

(0 =| to-n 


Taking the Laplace transform in each case and applying equation 


(2), 
Lf) হু LU)L(L) ’ 
L(f) = LOBL(L) . 


quation by the second in equation (9), we 


(9) 


If we divide the first e 
obtain 
Ho) LU) (10) 
L(A) LOL)" 
This is a fairly crucial Consequence of our assumptions, for it 
is seen that all mention of the base time has been eliminated. It 
1S an equation relating the empirical data to f and No. 
At this point we should raise an important practical problem. 
Empirically, one does not obtain estimates of the distribution f, 
but rather approximations to the cumulative distribution 


F(t) -/ f()dr. 
(1) 
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(Throughout we shall use small Latin letters to denote distribu- 
tions and the corresponding capitals to denote their cumulatives.) 
Now, while approximations to F may be reasonably accurate, it is 
well known that numerical differentiation of data tends to magnify 
errors and is, therefore, to be avoided. So the question arises 
whether we can translate our results, in particular equation (10), 
into statements about the cumulative distributions. From equation 
(3) we have 
L(f) = sL(F) + F(0) . 

Since we are speaking of empirical data we may assume F (0) = 0, 
and so equation (10) becomes 


= (11) 


Having eliminated fh from our discussion, the problem of deter- 
mining it remains. Since our division in equation (11) assumes 
f, is the same in the several cases, it will suffice to determine it 
from any cne. The simplest, of course, is the case where the graph 
consists of one point, in which case 

L(A)  L(F;) 


L(f) = LO) LF) (12) 


As an example of how equation (12) may be used, suppose f is 
exponential with time constant A. Then by equation (6), 


ee PE 
e 31 


and so equation (12) becomes 


Lf) = 7 L() + L(V). 


If we make the reasonable assumption that h(0) = 0, then from 
equations (8) and (5) we find 


1 af 1 af 
Lf ).= ENE fl 
(fs) Le 0) = Ls চট +) R 
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Assuming that f, is continuous and that fi has a continuous de- 
rivative, equation (4) implies 


Or integrating from 0 to t, 


(13) 


XL 
Fe = ঢু He Pl ts 
Since f, must be determined from empirical data, it is clear from 
equation (18) that considerable data will be necessary to obtain 
accurate estimates of F,. 


VII. Serial Decision Process. An alternative program to solving 
the general problem discussed in Section VI is to discover the 
consequences of certain explicit assumptions about the directed 
graph N and the elementary latency fh. The results of this alterna- 
tive program will, unfortunately, be much weaker than a solution of 
the general problem, but they may have considerable heuristic 
value. We may choose such extra assumptions on intuitive grounds, 
with the hope that they may be relevant for some experimental 
data. We shall examine two cases Which are, in a sense, the two 
most extreme forms of the directed graph N. The first, the topic of 
this section, is the general serial case shown in Figure 8a, and 
the second, which will be discussed in Section VIII, is the parallel 
Case shown in Figure 3b. 


a Stimulus e— — =o >e@-------- -o—_—_=0 


b Stimulus 


FIGURE 3 
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It follows immediately from Assumptions I and II.2.b that the 
observed distribution f, of a serial process having n nodes is given 


by 
$ স্ন 

hd =f hl htt ~t)...hilt~-t dt, dty...dt. (18) 
0 0 0 


Applying the Laplace transform to equation (14) and using equation 


(2) we have 


L(f,) = L(h)L(L)" s (15) 
or dividing by the case n = 1, 
2) i BE) 
Eh) LOO = L(F,) 0 


Equation (16) is the explicit form of equation (11) for the serial 
case. Clearly, if we have given numerical data we may determine 


(possibly numerically) f, for each value of n. 
As an example of how this might be done when we know the 


general form of f,, Suppose /, is exponential with the time constant 
A. In that case, equation (16) becomes 


L(F,) 1 (17) 


L(F,) (i 3% 5) 


8 
In Figure 4 we have presented plots of ক — for small 
0 
values of n. { A 
A second equation may be obtained by observing that the mean, 


Li(n), of a serial process with n exponential elementary decisions 


is given by 
nln) = mlb) +3 (18) 
Where pu, (2) is the mean base time. Thus, 


EELS 


(19) 


We may now use equations (17) and (19) to attempt to decide 
whether a given set of data is adequately fit by the assumptions of 
the model, plus the added assumptions of a serial directed graph 
and exponential elementary latencies. There are serious statisti- 
cal questions as to how this may best be done, but the following 
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FIGURE 4 


ready method may suffice until the statistical problems are formu- 


lated and solved. From the data we compute i as a function of 
L(F 
F j 
8; this we may assume is in the form of a plot, which we shall call 
plot A. For each (reasonable) value of n and for some value of চ 
8 3A 2 
S8y ~ = 5, find in Figure 4 the corresponding value of ক . 
( X+ 1) 
We know from equation (17) that this must be equal to Er; if our 
L(F, 
8SSumptions are correct and if the correct value of n has been 
chosen. We thus enter plot A at this point and determine the value 
of 3. Since we selected A = 28 » this determines A. But equation 
(19) Presents a relation between the observed means, A, and n 
which will be Satisfied if our assumptions are valid. We choose 
the value of n such that the error between the observed means [the 
left side of equation (19)] and (n ~ 1)/A is a minimum; this yields 
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the best possible fit at the point s/X = 1/2 for the model with the 
added assumptions of a serial graph and exponential f.. Using 
these values of A and n, one may add the theoretical curve 


l TT VS. 8 to plot A, and a comparison between the two 
ন 

al 

curves will give some indication of the adequacy of the assump- 

tions. Clearly, a less subjective criterion of the quality of this 


fit is needed. 


VIIL. Parallel Decision Process. If we suppose that the n elemen- 
tary decision processes are carried out in parallel (see Figure 3b), 
the choice latency distribution is the distribution of the largest of 
n selections, one from each of the elementary distributions. This 


is known to be given by 
fF 
ডা F(1) , 
dict! 


which in the case when all the elementary distributions are the 
same, namely EF, , reduces to 


CACAO 


If we denote the observed reaction-time distribution for the parallel 
case by g, , then it follows from equation (7) that 


gil) -/ filDnfAt - DIF At ~ 7) dr . (20) 
0 


Applying the Laplace transform and equation (2), 
L(g,) = L(f)L(nf EF"). (21) 


As before, we may divide by L(g,) to eliminate L(f,) . 
To proceed further, we assume f, 1S exponential, then 


L(nf.E,"™") হে nf ete [1 ক e-*]n-! dt, 
0 


bind nl 
< oY e-(s+ Mt be (nz) (-1)ke-*N dt y 
0 k=0 
n-l 
=m LoS Cos) EE CEs 
2 (EDC টী 
k=0 5 k+l 
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To evaluate the above sum, consider the function 
= 5 চং 
” (2) Na (E(t z* ve 2” (1 2)" f: 
kb 


Observe that 


1 


n~l { ন 
a P(z2)dz =n Ta nef AEE } 
0 


k=0 0 
n-l 

=D CEVICHE — — টী 
k =0 সণ k+1 


EES, 


1 1 
f o0)ds nf c5/t (1 Eis 2S 


0 0 

nB (3 A n) t 

[i 1) ron) 
(3 +n+ 1) 


Where B(m, n) is the Beta function and T'(n) is the Gamma function. 
From these results we easily obtain 


and that 


’ 


L(g) BF +1) 


Lg) (i , 1) 


et টু (22) 


8 
: ur(3 + 2) 
In Figure 4 we have also presented plots of 2 


ত VB; 
for small values of n. (3 +n + 1) 


& 
A 
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The mean of the parallel process can be shown to be given by 


: Bel 3 টু 
Li(n) = hi (0) 5 ু and thus we have, as in the serial case, a 


= 
second relation which must be met 


n 


mn) -n(D-rD (93) 


i=2 
The procedure for curve fitting is the same as described for the 


serial case except that ড= 1 seems to be a more favorable place 


8 
to enter the graph than is = 5. 


IX. Model Selection. Without a solution to the general problem 
described in Section VI, there arise statistical problems as to how 
well a particular set of assumptions, Such as serial directed graph 
and exponential f,, fit the data and whether another set of similar 
assumptions is bettor or not. In addition, within any one set of 
assumptions there are undetermined constants, such as A and n, 
and there is a question as how best to choose them. We have indi- 
cated one procedure (end of Section VI) to determine the con- 
stants, but it is almost certain that such an ad hoc procedure is 
not optimal. 

The difficulty of making a selection among different sets of as- 
sumptions is evidently quite serious for it can be seen from Figure 
4 that for almost any small value of n in one there is an nin the 
other such that the two curves are fairly similar. Presumably, any 
other directed graph will produce curves which, in some sense, lie 
between these two extreme cases. Thus, the shape of the empirical 
data curves will not be extremely revealing of the proper directed 
graph to use—an unfortunate situation. 

It is clear that there are a number of difficult statistical prob- 
lems here, but in all likelihood it will prove to be more efficient 
first to do some experimental exploring using subjective judgments 
as to goodness-of-fit before trying to formulate and to solve the 


statistical problems. 


X. The Perceptual Moment. In Section II we remarked that in 
reaction-time studies the mean reaction time should be of the order 
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of one second if unwanted interactions with other stimuli are to be 
avoided. This means that the data will be in a range where certain 
peculiar phenomena have been observed. To explain these ob- 
Servations, it has been proposed that a subject processes informa- 
tion very rapidly at certain discrete times and that he is in a re- 
fractory period between them. The period from the beginning of 
one such hypothetical event to the beginning of the next has been 
termed the perceptual moment (Stroud, 1949a,b). Unfortunately, 
relatively little direct experimentation has been conducted on this 
problem, and so it is not possible at this time to give a formal 
statement of the properties of the moment. Indeed, there are in- 
vestigators who doubt its existence. In the case that it does exist, 
Our analysis will be applied to situations where it most probably 
will have an effect. It is, therefore, of interest whether the analy- 
Sis can be adapted to cope with it. In this section we shall make 
2 simple hypothesis as to the nature of the moment, not with any 
belief that it is correct, but only to indicate that the general fea- 
tures of the analysis remain unchanged. 

Let us assume the moment is of fixed duration, say 6 seconds, 
and that while a person may receive information at any time during 
that period it will only serve as a Stimulus at the end of the pe- 
riod. Furthermore, we will assume that all intermediate (elemen- 
tary) decisions occur at multiples of 5. Since we may assume that 
there is no correlation between the stimulus presentation and the 
timing of the moment, we may assume the stimulus is presented ac- 
cording to a uniform distribution Ah in the interval 0 to 5. This 
assumption may be inappropriate, for it may happen that a person 
1s only able to assimilate information during part of the moment; 
we shall return to this point later. 

The question now arises as to the discrete form we should as- 
Sume for the elementary decision process. In the continuous case 
Wwe took it to be exponential, and so we shall use the discrete 
analogue. We assume that if no decision has been reached by the 
tth moment following the presentation, i.e., at time id, then the 
probability of a decision in the ith moment iS A656. If we call the 
probability of a response by the ith moment P;, then 


Pe iP 1-P,,l]A6 
t RE Lh [ il 3 (24) 


(AB AB: 
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With the initial condition Py = 0, the difference equation (24) is 
solved by 
Pj=l-(l =X6)*. 
The probability of a decision in the ith moment is obviously 
[(1-P;_lA6; 
hence, we have 
A6(1 - A6)-!, (25) 


as our distribution f, . 
If we replace this discrete distribution, equation (25), by a 
continuous one ®,. which has rectangles of width ¢ and height 


Ee i-1 
centered about the point 16, then it is clear that in 
the limit as ¢ —> 0 this becomes the discrete distribution. 


Let the base-time distribution be denoted by fi, as before, then 
the observed data in the discrete serial case is given by 


0 -dn ff | tte - 0d - to)... 
0 0*০ (26) 
PAt-ti)dt dt 


n+l +1 


Applying the Laplace transform and using equation (2), 
L(f,)= lim L(f,)L(OA)L(G)" = LA) LO] lim LOO] (27) 
ন €>0 8 €>0 


Observe, 


L0)- J en00a, 


Ee) 8+ p 1 
a ) e- A6 (1 0) dts 


E 
i=1 “ibs 


EOE AE 2 53 X= M6) leit 
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isl 


5 
€  =£e By 


EL PLO 


i=l 
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But, 


S50, 


lim L (0) A6e-s8 

aso, “MENDES 
Substituting in equation (27) and dividing by the case n =1, we 
have 


Ee (28) 
Ef) LIEU MWe 


which is the crucial equation for the discrete serial case. The 
mean of the discrete distribution fe iS given by 


ন -[ Ade-s J 
5 ) 


)' isAs(1 -A)1 = | (29) 
i=1 
Thus, the relation between observed means is 
= 1 
hin) -ull)= (30) 


Now, if we know the value of 6, i.e., the length of the moment, 
then these two sets of equations may be used in exactly the same 
fashion as were equations (17) and (19) of Section VII. We have 
no theoretical value of 5, so it will be necessary to perform in- 
dependent measurements of it. It is clear that if the perceptual 
Moment is a real phenomenon it will be important to ascertain its 
Properties prior to analyzing experiments on reaction time. 

One further comment of some interest: If we ignore f, and let 
1 = 1, the convolution of A and PL, when ¢ — 0, is a step function 


Such as that shown in Figure 5. 


The convolution of this function 
with ff, , 


for reasonable f, will serve to smear the steps but it will 
fot utterly destroy them. Smearing will also result if n is larger 
than 1, the amount depending on the value of n. Thus, if our as- 
Sumption as to the moment is roughly correct, we should expect, 
at least for comparatively simple situations, to find the observed 
latency distribution somewhat lumpy. Indeed, in the literature (cf. 
Woodworth, 1938) it has been remarked not only that the data are 
lumpy but that there is an oscillation superimposed on the distribu- 
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FIGURE 5 


tion curve. This effect could easily be obtained analytically if we 
were to assume h uniform over only a small portion of the interval 
0 to 5, in other words, if we assume the vast majority of the mo- 
ment is truly a refractory period during which there iS no intake of 
information. These considerations bring out even more strongly the 
need for comprehensive experiments to determine the properties of 
the moment. 

We shall not attempt, as before, to study the parallel case. The 
reasons are that the mathematical problem is rather complex and 
with so little information on the nature of the moment it hardly 
seems worthwhile to carry out the analysis. Furthermore, we are 
of the opinion that it is unlikely that information accepted in dif- 
ferent moments is dealt with other than serially. It may happen, 
however, that the information accepted in one moment is processed 
in parallel. The latter remark is a possible hint for developing an 
explanation of the effect of changing the number of ‘‘psychological 
dimensions’’ in an information display. 


XI. Experimental Proposals. The key assumption in our analysis 
is that elementary decision processes can be found of such a sort 
that complex decisions can be built up from them in a way which 
leaves their characteristic A value invariant. One should like to 
present experimental subjects with stimuli which vary in several 
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dimensions but for which decisions on each of the dimensions have 
identical time characteristics. If one uses conceptually different 
dimensions, we run into the difficulty of possibly introducing 
Several different A values. If we use several objects with the 
same dimension relevant for each and with identical characteristics 
in every other respect, we have the difficulty that the reception of 
the stimulus may not be unitary, but broken down into several 
parts separated by receptor orienting acts such as eye movements. 
The first of the two following proposals suffers from the latter 
difficulty; the second from the former. 


Ist Experiment: Digit Difference Perception 


Stimuli: White 3" x 5" cards with a triple-spaced typed, horizontal 
row of vertically aligned pairs of digits, 0 and 1, on each. The 
number of pairs per card to vary from one to sixteen. On each card 
either one pair or no pairs will be unlike digits, i.e., (0,1) or (1,0); 
the remainder like pairs, i.e., (1,1) or (0,0). The place of the un- 
like pair in the series of pairs to vary from the initial to the final 
position. Cards with the unlike pair in each of the positions from 
one to n will be included in the set with equal frequency, and 
cards with no unlike pair will be included with the same frequency. 
The assignment of (1,1) or (0,0) to the remaining places will be 
made on an equiprobable random basis, and the choice of (0,1) or 
(1,0) for the unlike pair will be made on the same basis. 


Responses: Experimenter will announce prior to each stimulus 
Presentation how many pairs the card to be shown bears. Subject 
will respond yes or 0, depending on whether the card does or does 
not bear an unlike Pair, by pressing the appropriate one of two 
keys. The subject will be told that an unlike pair in each of the 
Possible positions, including in no position, are equally likely 
events, and will be instructed to read the lines of pairs from left 
bo right, The data of primary interest will be the latencies of the 
70 response to the cards which bear no unlike pair and the la- 


Une 
PNcles of the yes response to the cards which bear an unlike Pedy 
in the nth Position. 


Apparatus: 1, Stimulus cards as described above, 


2. Light projector with fast Shutter, 
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3. Three telegraph keys: (a) for the subjects to rest 
their fingers on prior to response so that the re- 
sponse will always start from the same situation. 
(b) for yes responses (c) for no responses. 

4. A buzzer of ¥% sec duration as a warning signal to 
be sounded ending 1 sec before shutter opens to 
illuminate stimulus. 

5, Recording chronoscope accurate to at least + 10 
millisec. 

6. Timer for ready signal and shutter operation with 
silent starting key for the experimenter. 


2nd Experiment: Multi-attribute Perception 


Stimuli: Ten decks of 32 cards each to be prepared using two 
values on each of five attributes according to the following scheme: 


Attribute Values 
1. Number of spots 2;8 
2. Color of spots Red; black 
8. Shape of spots Round; square 
4, Arrangement of spots Horizontal line; vertical line 
5. Background color White; green 


Responses: Experimenter will announce what pattern of attributes 
is to be responded to positively prior to each stimulus presenta- 
tion. Subject to make a yes or no response by pressing the appro- 
priate one of two keys as exemplified below: 


Experimenter Says Stimulus Presented  $ to Respond 


1. Round red Two black squares in No 
horizontal line on white 
card 
2, Vertical line of squares Three red squares in Yes 
on green card vertical line on green 
card 


The instruction-stimulus pairs which call for a negative response 
should be half of the total number of stimuli presented in each 
attribute-pattern category so that the uncertainty of response prior 
to stimulus presentation will be equalized at the maximum. The 
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data of primary interest will be the latencies of response to the 
set-stimulus pairs calling for a yes response. 


Apparatus: Same as for the first experiment except for the stimulus 
cards. 
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Introduction 


There are two very striking characteristics of the field of psychoacoustics. One 
is the breadth and variety of research skills and techniques used to study hearing. The 
techniques range from hydrodynamic studies of the cochlea to analysis of the percep- 
tion of vowel forms. This multidisciplinary approach is a fortunate one since it reduces 
the chances that any really significant aspect of the sensory system is being overlooked. 
However, it creates a diversity which makes integration of these areas most difficult. 

A second characteristic of the field is the lack of any integrative structure from 
which to view the rapidly expanding experimental literature. If some basic theoretical 
structure existed, these new data might easily be integrated with the old. Psycho- 
acoustics, however, does not have any complete comprehensive theory. A reflection of 
this deficit is the lack of consensus on methodology. Often, even where a general 
consensus seems to exist in some area of the field, a new paper may force a complete 
re-examination of the entire measurement procedure. A recent example of the latter 
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mission. i 
* Editor's Note—This is the first of a series of tutorial papers on aspects of acoustics of 
ublication is supported in part by a grant from the National Science 

kind will follow in subsequent issues. 
4 This paper was partially supported by the U.S. Air Force under contract, monitored 
by the Operational Applications Office, Air Force Cambridge Research Center, Laurance G. 
Hanscom Field. Bedford, Massachusetts, and administered by the Research Laboratory of 

Electronics, M.L.T. This is Tech. Rept. No. AFCCDD TR-60-20. 


recent interest. Its Pp 
Foundation. Other papers of this 


41 


42 READINGS IN MATHEMATICAL PSYCHOLOGY 


may be found in the exchanges of Garner! and Stevens* on the quantitative scale of 
loudness. Such a situation compounds the problem of integration. 

This paper, therefore, makes no attempt at broad coverage. The author hopes 
that by concentrating on one rather limited topic some positive contribution can be 
made. This topic is the detection of signals in noise. In recent years a general theore- 
tical structure (detection theory) has been used to analyze such experiments. Un- 
fortunately, there appears to be some confusion both about the theory itself and the 
manner of its application. The main objective of this paper will be to clarify these two 
questions. Part of the confusion about the theory arises from the fact that detection 
theory is a combination of two distinct theoretical structures: decision theory and the 
theory of ideal observers. Before we begin a detailed discussion of these two aspects 
of detection theory, we will briefly outline them and relate them to psychoacoustic 
problems. 

Decision theory provides an analysis of the process which generates the dicho- 
tomy between stimuli the subject reports he does and does not hear. The theory 
recognizes that a priori probabilities, values, and costs of correct and incorrect decisions, 
as well as the physical parameters of the signal, play a decisive role in establishing this 
dichotomy. We will find that this dichotomy is determined by an adjustable criterion. 
The theory shows how a quantitative estimate of the criterion can be obtained from the 
data. 

There are many psychoacousticians whose only interest in this criterion is as a 
constant parameter from which to obtain substantive relations between two physical 
parameters, for example, the absolute threshold energy as a function of frequency, or 
the just detectable change in power as a function of power (AI vs 1). To them this 
aspect of detection theory will be of methodological interest only. Yet clearly, if 
factors such as a priori probability, values, and costs do play a role in determining the 
threshold, their control in substantive experiments is imperative. 

The second part of detection theory is more directly related to substantive 
matters—it is the theory of ideal observers. Briefly, the theory provides a collection of 
ideal mathematical models which relates the detectability of the signal to definite 
physical characteristics of the stimulus. There is a collection of such models because 
one may make different restrictions on the nature of the detection device. These 
theoretical observers are rarely used as actual models of the hearing mechanism. 
he are ar for the sake of comparing human performance with that of 
SBE In order to specify the nature and amount of discrepancy. This 
Le FE I suggests either a new and hopefully more accurate representation 
erent es ATS, Or NEW experiments to clarify further the exact nature of the 

y. is will be illustrated in a later section of the paper. 


Decision Theory 


he demonstrate, under quite general assumptions, how a transformation of 
detectabilit SN can be utilized to determine both the subject's criterion and the 
concepts hs liz হং SET: This analysis requires an understanding of several basic 

TEC TANCE complex. We might skip over these fundamentals and start, 
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as some previous expositions have, with some assumptions about Gaussian distributions 
and parameters of these distributions. Such a procedure would be unfortunate because 
it robs the analysis of its generality and implies that strong assumptions are needed to 
justify its applicability. Such is not the case. 

Typically, psychoacousticians try to analyze the subject's responses by making 
some assumptions about the way in which the sound is processed by the hearing 
mechanism. One assumes, for example, that the cochlea either makes a frequency analy- 
sis of the waveform or that it does not, etc. We wish to postpone temporarily such 
substantive issues. Let us, for the present, merely assume that each sound may be 
represented by a series of numbers. These numbers might be the values of a series of 
attributes, or various states of the nervous system. Whatever the representation, let 
us call this abstraction an observation. 

The problem we wish to consider is this: Given an observation, what response 
alternative should be chosen? What is a good choice and how can we analyze these 
choices? We shall attempt to answer these questions by considering a single example. 
The example is obviously specific; the generality rests in the concepts. The single 
motive in presenting this example is to enable us to discuss these concepts—likelihood 
ratio, decision rule, and criterion—with some precision and yet avoid formalism. ? 
After this theoretical discussion, we shall investigate the applicability of these concepts 


toa psychoacoustic experiment. 


An example of decision theory 

Let us assume we have 10 observations, each observation (X;) represented by 
three numbers [X; = (ri, 12, 3)], and that we have two hypotheses, Hi, Hs, about the 
observations. Given an observation, we wish to decide whether the observation is an 
instance of Hi, or Hi. We shall assume we have complete information about the 
probability of each observation given each hypothesis. 

By limiting the example to 10 observations we can work with probabilities 
directly. The reader should note that the three numbers (r;, 12. 13) could have been 
extended to three hundred. Everything that follows is independent of the dimension- 
ality of the observation. The variables (x) of the observation could be quantitative 
(integers or real numbers) or qualitative (red, blue, or green). They are simply de- 
scriptions of the observation. 

Likelihood ratio. In Table I, we have listed the observations and the three 
numbers corresponding to each observation. The next two columns provide the data 


a] decision theory and the theory of 


s These concepts come from the topic of statistic: 
d by Wald, who extended the basic 


inference. Most of the key theorems were first presente 
principle which originated with Neyman and Pearson. 

1 A. Wald, Statistical decision functions, New York: Wiley, 1950. 

5 J. Neyman and E. S. Pearson, Phil. Trans. Roy. Soc. London, 1933, A231, 289. 

s For a concrete interpretation of the example, the reader might think of the observation 
as a sealed package, the three numbers as the length, width, and depth of the package, and the 
hypothesis as whether the package contains a toy car or animal. The problem, then, is this: 
Given the measurements of a package, guess whether it contains a car or an animal. Alter- 
natively, one might think of the observation as a sound which can be specified by three numbers 
or attributes. The problem is: Decide from the three numbers whether the sound is a 
consonant or a vowel. 
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TABLE I 


Description of the Observations (X;) and the Probability of Obtaining 
That Observation Given Either Hypothesis (H; or Hi). 


CEE) 


_Pyir; 
Observation xz; hy LE CEE) Por Gyn, 83) Pylori, 
4X; 4 3 : 0.14 0.01 14.00 
As }: 3 5 0.01 0.01 1.00 
Ks % 32 4 0.03 0.30 0.10 
Xs ৰ 3 3 0.30 0.10 3.00 
Xs 2 3 3 0.02 0.04 0.50 
Hs 5 le 2 0.09 0.01 9.00 
Az 2 রর 5 0.10 0.08 1.25 
Xs 3 4 5 0.20 0.05 4.00 
Ay ৰ F) 5 0.06 0.30 0.20 
X10 4 2 5 0.05 0.10 0.50 


Total 1.00 1.00 


On the probabilities of each observation on each hypothesis. The final column is simply 
the ratio of the fifth column to the Sixth and represents the likelihood ratio. The 
likelihood ratio, then, is the probability that a particular observation resulted from 
Hi; divided by the probability that it resulted from Hs. The likelihood ratio gives what 
Some call the “odds.” If we have (Xs) we should be willing to wager nine cents to one 
that Hi; is correct. Note that the likelihood ratio is a number, not a probability, and 
that this number is a function of three variables (r,, “2,.3). Thus we have taken an 
Observation which is specified by three values (i, 19, "4), and related it to a single 
variable (x, 2, 23). 

The reason we have perform 


make optimum decisions if we use t 
mean by 


ed this transformation is simply stated: We can 
he likelihood ratio. We have not stated what we 


€ optimum, but let us take up this point a little later. First, let us show how we 
Might use the likelihood ratio in making decisions. 


Decision rule. If someone asks us to make a d 
Observation, whether it is an instance of Hi or Hs, we would probably guess it was Hi 
if the Probability of that observation was Sreater on Hj than on H. Such a statement 
1S called a decision rule. In terms of likelihood ratio this decision can be expressed as 
follows: Choose Hi; if (X)>1. In effect, we have specified our decision rule by 
choosing one number; in this case, the number “one.” This number is called a criterion 
OF, more precisely, a likelihood-ratio criterion. 
না oe that, independent of any specific Observation, Hs was ten times as 
LO ie Ne would not maintain Our previous criterion; even without 
It TRS LG Cc NE of the observation, the odds are ten to one in favor of Ho. 
SU in this case that we should choose Hi onlyif I(¥) > 10. Thatis, we should 

& 1 Only if, In our example, the specific observation is X = (4,3, 3). 
and HOE < We place asymmetrical values and Costs on the various correct 

cisions, we should change our criterion or likelihood ratio accordingly. 


ecision about a particular 


DAVID M. GREE: 45 


Monotonic functions of likelihood ratio. While we can state our decision pro- 
cedure in terms of likelihood ratio, there are other exactly equivalent ways of stating 
the decision rules. In the example, it so happens that the product +) times .r» Minus x3 
is also an optimum decision quantity. This is true because this quantity is monotonic 
with the likelihood ratio. The criterion number is not the same as that we would use 
On a likelihood-ratio scale, but there is always some number on this monotonic scale 
which corresponds to the criterion number on likelihood ratio. For example, suppose 
we select the alternative Hj; if I(ri,.r2,.3) > 1.25; then we would make identical 
decisions using the decision rule, select Hi if (vr; -.r» — 3) > 5.00. 

In many cases, such as the application of this theory to psychoacoustics, the 
decision axis is unobservable, and hence we are only interested in equivalent decision 
procedures. To say the observer uses an optimum decision procedure means only that 
he is using a monotonic transformation of likelihood ratio. 

Optimum nature of likelihood ratio. We turn now to the very important ques- 
tion of the optimum nature of likelihood ratio. Clearly a decision procedure based on 
likelihood ratio is only optimum if it best attains some specific objective. Let us list 
some of these objectives to indicate their generality: (1) maximize the expected value 
of decisions,’ (2) minimize risk, (3) estimate a posteriori probability," (4) maximize the 
percentage of correct decisions, and (5) set the error rate on some decision alter- 
native at some constant and maximize the number of correct decisions for the other 
alternative.® The impressive fact is that a decision criterion based on likelihood ratio is 
optimum under all the above objectives. Naturally this criterion may be different for 
different objectives. The references listed with the objectives contain a more detailed 
explanation of each objective and prove how a decision rule based on likelihood ratio, 
or some monotonic transformation of that quantity, may be used to make the best 


decisions.!! 
Distribution of likelihood ratio. We have seen how each observation, indepen- 


dent of the number of attributes included in the observation, can be reduced to a single 
quantity—likelihood ratio. Likelihood ratio is simply a function of several variables 
and for any single observation is simply a number. We may then properly consider a 
probability defined on the variable likelihood ratio. Let us consider, in particular, the 
probability that we shall obtain a particular value of likelihood ratio under Hi; and Hs 
of the preceding example. Table II shows these probabilities and the corresponding 
cumulative distributions for both hypotheses of our example. The likelihood ratio is 
ranked from largest to smallest to facilitate the explanation of the ROC curve.1 
ROC curves and their properties. We shall use Table I to construct an ROC 
(Receiver Operating Characteristic) curve. To do this, let us assume the decision rule 
is to accept Hi if I(r, rs, 13) > kK. If k = 14 we find that the probability of accepting 


7 W. W. Peterson, T. G. Birdsall, and W. C. Fox, Trans. IRE, 1954, PGIT-4, 171. 
s T. W. Anderson, An introduction to multivariate statistical analysis, New York: 


Wiley, 1958. +: 
2 sP. M. Woodward, Probability and information theory with applications to radar, 


New York: McGraw-Hill, 1955. FINI 

10 To estimate a posteriori probability no criterion is involved. In this case the best 
সী of a posteriori probability is a simple monotonic transformation of likelihood ratio. 

11 Note that since two observations yield a likelihood ratio of 0.50, we have added the 


probabilities under both hypotheses to obtain the probability of that likelihood ratio. 
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TABLE II 


Probability under Each Hypothesis that 
I(X) Will Have a Certain Value. 


IX) Pril(xX)] Cumulative PH.[(xX)] Cumulative 
14.00 0.14 0.14 0.01 0.01 
9.00 0.09 0.23 0.01 0.02 
4.00 0.20 0.43 0.05 0.07 
3.00 0.30 0.73 0.10 0.17 
1.25 0.10 0.83 0.08 0.25 
1.00 0.01 0.84 0.01 0.26 
0.50 0.07 0.91 0.14 0.40 
0.20 0.06 0.97 0.30 0.70 
0.10 0.03 1.00 0.30 1.00 


Hi; when it is true [Py/,(H;)] is 0.14 and the probability of accepting H; when it is false 
[Pn(Hi)] is 0.01. By decreasing k, we change both probabilities. The upper curve 
shown in Fig. I shows how the probabilities change as a function of k, and is called an 
ROC curve. The two probabilities completely represent the stimulus-response matrix 
in a two-alternative detection task since the complements of Pj, (Hi) and Pir.(H;) are 
the two remaining cells in the Stimulus-response matrix. H f 


1.00 | : SE 
| “~~ Best 
0.80 possible 
performance 
 fa258) 
স্ 0.co | A Possible Vl 
z performance 
ox 
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ৰড 
= 0.40 | 
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4 | L 
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FIGURE 1 


) curve of the example. The axes are Py,(Hi), which 
if the observation was from H,, and Pu(Hi), which is the 
if the observation was from H,. The points were plotted from 
Table IV. 


The receiver Operating characteristic (ROC 
is the Probability of responding H, 
probability of responding H, | 
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What if some decision procedure which is less than optimum were used? Let 
us consider an extremely poor decision procedure. The lower curve of the figure was 
generated by using the decision rule accepting Hi if Ij, 25,23) < k for all k. This is 
the exact opposite of the first decision rule and hence generates the ROC curve for the 
worst possible decision rule. 

The area included between the upper and lower bounds on performance re- 
presents attainable performance using any decision procedure in this task. Obviously 
any single decision is either right or wrong, but any decision rule whatever, in the long 
run, will produce some probability of “hit™ and some probability of “miss” which lie 
within the bounds illustrated.!* Other decision procedures do not necessarily involve 
likelihood ratio. One procedure would be to flip a coin and select the first alternative 
if the coin landed heads; if the coin were unbiased, this decision rule would achieve 
an error and hit rate of 0.5 Should the coin be biased, this decision procedure would 
produce performance located somewhere along the center diagonal of Fig. 1. 

Another point to be noted involves the slope of the ROC curve based on the 
optimum decision axis. Notice that the slope between any two consecutive points is 
equal to the likelihood ratio of the higher point. Thus the slope must clearly diminish 
because each successive point represents a lower value for likelihood ratio. Any ROC 
curve which does not show a monotonically decreasing slope implies an incorrect 
decision rule. This means that the decision maker is accepting the first hypothesis when 
the likelihood exceeds a certain value and yet accepting the other hypothesis when 
likelihood ratio is some greater value. Any such inversion in slope for any ROC curve 
implies that better performance might be achieved by interchanging some of the points 
on the decision axis. 

ROC curve and percent correct using forced choice. The ROC curve is useful in 
a situation where the subject's response is limited to selecting one or the other alter- 
native. There are other ways in which the detection task may be structured; one 
involves the class of forced-choice procedures. For simplicity, we will consider a two- 
alternative forced-choice task. The extension to larger numbers of alternatives should 
be clear from the following discussion. A two alternative forced-choice procedure is 
one in which two stimuli are presented, one from each class, and the subject is asked, 
in effect, what was the order of the stimuli: HiHs or HoH? 

We shall calculate the probability of a correct decision based 0 
rule: Select the alternative Hi;Hs if the likelihood ratio on the first observation is 
greater than on the second. In effect, this rule says to pick the larger likelihood ratio 
and say Hi; for that observation. The reason for considering only this particular deci- 
sion rule is that this assumption is often made in the analysis of forced-choice tests.13 

Assuming the subject picks the larger of two likelihood ratios and says the 


n the following 


12 Jt should also be noted that the lines connecting the points in the ROC curve do in 
fact represent attainable performance. For example, a point located midway between the 
points (7,43) and (17, 73) is attainable by using a mixed-decision procedure, where Hi; is 
accepted if IX) > 3, each alternative is selected half the time by some random procedure 
if IX) = 3, and Hi is selected if IX) < 3. 

1s Were we to give a complete analysis of this situation we would first list all possible 
stimulus pairs (5;5;). Next we would consider the probabilities on the hypothesis that the pairs 
represented instances of H,H; or HsH;, compute a likelihood ratio, and, in fact, derive an ROC 
curve based on these computations. 
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TABLE III 


Calculation of the Probability of a Correct Response 
in a Forced-Choice Test. 


kK PHil(X) =A] Pillx) <A] Product 
14 0.14 0.99 0.1386 
9 0.09 0.98 0.0882 
4 0.20 0.93 0.1860 
3 0.30 0.83 0.2490 
1.25 0.10 0.75 0.0750 
1.00 0.01 0.74 0.0074 
0.50 0.07 0.60 0.0420 
0.20 0.06 0.30 0.0180 
0.10 0.03 0.00 0.0000 


Sum 0.8042 


likelihood ratio was produced by Hi, we shall be correct if the | 
fact produced by Hi and the smaller w 
this occurrence is Pip [CX )] - Py Ly 


arger likelihood was in 
as in fact produced by Hs. The probability of 
X)] where [i(X) > IX). In fact, if the larger 
likelihood ratio is equal to &, the probability of a correct choice is simply: Pry) 
= &] ° SP, [(X) < KJ.“ To obtain the final result we need only summate over all 
the values of Kk, since any of these values might be the largest, except the lowest value 
of likelihood ratio itself. 


Table III gives these calculations and the final answer (0.8042). While the 
method of calculating this probability is strai 
acoustic experiments, one does not h 
scale. 


ghtforward, often, especially in psycho- 
ave numerical distributions on a likelihood-ratio 
Two approaches could be used in these situations. The first, and the safest, 
Since it makes no additional assumptions, would be to compute the probability from 
an experimentally determined ROC curve. If You look at Table III closely, you will 
see that the quantities used in the calculation are simply AP, (Hi) times [1 — Py (H)] 
for each successive point on the ROC curve (Fig. 1). Obviously, the accuracy of such a 
Procedure is heavily determined by the accuracy of the experimental estimate of the 
ROC curve. The merit of the technique is that no assumptions beyond that of the 
decision rule are necessary to predict forced-choice behavior from the ROC data. 
. A second procedure, one which has often been used, is to make some assump- 
tons about the distributions which generated the ROC curve and then use these assump- 
tions in Predicting behavior in the forced-choice experiment. The most popular set of 
Assumptions is that the distribution of observations on the likelihood-ratio axis, or 
Some Monotonic function of that axis, is normal or Gaussian under both hypotheses. 
The distributions are assumed to differ only in their means and, sometimes, in their 
Standard deviations. Let us assume, for simplicity, that standard deviations are equal 
under both hypotheses, then the ROC curve can be characterized by one parameter; 


1 j gj 
ects If more than two, say M, alternatives are used in the forced-choice test, the equation 
es 


CE 
P(correct) = Pit = of Xen ন্‌ ol 5 
Ly t 
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the difference in the means divided by the standard deviation (XM/o). This parameter 
is usually denoted by d° = NMJ. The calculations of the probability of a correct 
detection in a two-alternative forced-choice situation if these assumptions are made are 
quite simple. The probability that one likelihood is larger than another is the prob- 
ability that the difference is greater than zero. Since, by assumption, some transforma- 
tion of ICY )is normal, the difference distribution is normal with a mean of \M and a 
variance equal to the sum of the original variances. Hence the probability of a correct 
decision is 
P(correct, 2 alternative) = P[AM/( + o3)'] = P[d'((2)]. 


The probability of being correct for any number of alternatives is given in footnote 
reference 15. 

We have now reviewed all the essential aspects of how detection theory uses 
decision theory in analyzing the process of detection. Let us now turn to some experi- 
mental results and see to what extent these notions are supported. Following this 
review of the experimental studies, we shall conclude this section with a discussion of 
the implications of these studies for psychoacoustic procedures in general. 


Experimental results 

ROC curve. One of the earlier studies!® simply sought to determine experi- 
mentally the shape of the ROC curve in a simple psychoacoustic task. The signal was 
a 1/10 second of a 1000-cps sinusoid. White noise, the masking stimulus, was present 
continuously throughout the experimental session. A light occurred to mark the ob- 
During this interval either the signal was added to the noise (SN) 
or simply the noise was presented (N): these were the two hypotheses of the detection 
task. The subject gave one of two possible responses: he pressed one button if he 
believed the signal was present (VES) or pressed a second button if he believed no 
signal was present (no). The physical parameters of the situation, including noise 
and signal levels, were held constant. The independent variable was the probability 
(a priori) of a signal being present. Five levels of a priori probability were selected 
(0.1, 0.3, 0.5, 0.7, 0.9) and the one used for a given session of 300 observations was 
announced to the subject. After the subject responded, he was given immediate 
information as to whether or not the signal had in fact been presented. The subject 
was awarded some fraction of a cent for each correct answer and fined an equal amount 
He was instructed to make as much money as possible. 

The results for one of the subjects are presented in Fig. 2. [Px(A) is the prob- 
ability of saying “yes” when noise alone was presented.] The general trend of the data 
supports the decision-theory analysis. The curve drawn is generated by assuming the 
distributions on likelihood ratio are normal under both hypotheses. The normalized 


difference between the means is 0.92. 
Threshold model and the ROC curve. Before considering whether or not the 


subjects adopted the proper criterion sO as actually to maximize their payoff, let us 
consider one alternative explanation of the data. This is the so-called threshold model. 


 B. Elliott, Electronic Defense Group, University of Michigan, Technical Report 


servation interval. 


for each incorrect answer. 


i ) 


No. 97, 1959. 
1s W. P. Tanner, J. A. Swets, and D. M. Green, Electronic Defense Group, University of 


Michigan, Technical Report No. 30, 1956. 
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sample of the ROC curve from an auditory detection experiment. See footnote 16. Py(A) 

the probability of responding yes" when noise alone Was presented. Psyx(A)is the proba- 

ity of saying “yes when signal-plus-noise was presented. These probabilities were estimated 
from the Stimulus-response matrix. See text for details of the experiment. 


1e essentials of this model are that the signal, when 
Ocess within the organism, such that if the incre 
e threshold, the Signal is heard and can be Corr 
Eat difference with the decision-theory analysis except in semantics. If one calls the 
Cision-theory criterion a threshold and the hypothetical process likelihood ratio, 
° correspondence is complete. The differences between the models appear when one 
nsiders “subthreshold” events and the procedures used to deal with these events. 
1e threshold model assumes that should the signal increment fail to reach the thresh- 
l, the subject can Only make a pure guess as to whether or not the signal is present. 
lis is surely true since anything below the threshold is Just that. If ordering is pre- 
‘ved below the threshold, the word has no meaning. The difference in terminology 
tween criterion and threshold is important, for to say the subject adopts a criterion 
to simply say an arbitrary cut point on a continuum is used as the decision rule. 

Given that the Subject guesses about events which are “‘subthreshold,™ he may, 
blanks are ever employed, report the signal is present when it is not (false positive 
iponse). Two techniques, both consistent with the threshold assumption, might be 
ployed if this occurs. One Procedure widely used is to instruct the subject to be more 
reful; this can be interpreted as an attempt to instruct the subject to respond 
Batively to all “subthreshold” events. The implication of this procedure will be dis- 
Ssed in a later section. Another Procedure, equally valid from the assumptions of this 
Pdel, would be to employ a correction for guessing. This correction procedure as- 
Mes the guessing mechanism and the sensory mechanisms are independent. The 
cellent experiments of Smith and Wilson!? were the first, I believe, to show the in- 
equacy of this second Procedure. This fact led them to reconsider the entire notion 


7 M. Smith and E. A. Wilson, Psychol. Monogr., 1953, 67, Whole No. 359. 


added to the noise, augments some 
ment reaches a critical level called 
ectly detected. So far, we note no 
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of the threshold and they presented, as an alternative model, one very similar to that 
suggested by decision-theory analysis. (See especially Sec. IV, footnote 17.) Munson 
and Karlin,'S using an information-theory analysis, investigated the detection process 
under “absolute threshold conditions." In order to deal with false positive responses, 
they proposed a “discriminant level model." This model is also very similar to that 
Suggested by decision-theory analysis. 

The threshold model could still attempt to account for the data shown in Fig. 2. 
The argument would run as follows: Suppose the subject achieves some hit and false- 
alarm rate. If the situation is changed in some way, he can modify his behavior by 
simply giving more “‘yes’ responses. Since this guessing rate is independent of the 
stimulus conditions (both noise and signal-plus-noise events are below the threshold) 
this will increase, by the same relative amounts, both the hit and false-alarm rates. 
In short, a linear function will result. In the extreme, the subject says “yes” all the time, 
hence this linear function must go through the point in the upper right-hand corner 
[PS(A) = 1.00, Psxs(A) = 1.00]. Thus the threshold prediction for the data is a collec- 
tion of lines having the upper right-hand corner as the common intercept, and a slope 
depending upon the detectability of the signal. No linear function which has this 
intercept as one value can fit more than a few of the data points for any value of the 
slope. The results of this first experiment, then, seriously conflict with this version of 
the threshold model and give some measure of support to the decision-theory analysis. 

The conflict between some version of the threshold model and the decision 
analysis has been the subject of considerable experimental effort. There are other 
experimental results more damaging to the threshold position. These experiments 
attack the threshold concept directly because they suggest that ordering below the 
threshold value is indeed possible.'s We shall drop this conflict and proceed to other 
questions. 

Actual criterion and optimum criterion. 
played in Fig. 2 and discuss the question of the optimum criterion. It turns out that if 
one wishes to select an optimum criterion on likelihood ratio, it is equal to # = 
P(N)/P(SN), where Bis the criterion value on likelihood ratio and P(N ) and P(SN ) are 


abilities of noise alone and signal-plus-noise, respectively. We can, 
rion by measuring the slope of 


Let us now return to the results dis- 


the a priori prob 
of course, obtain a rough measure of the subject's crite 
the ROC curve at the point nearest the experimental data point. This rough compari- 
son is displayed in Fig. 3. Note that while there is a strong relation between the 
estimated and optimal criterion values, there is also a consistent departure from an 
exact correspondence. The general trend might be summarized by saying the subjects 
are conservative: they tend to adopt criteria which are not as different from f =! 
as they should be. This result is almost an inevitable consequence of the procedure. 
The way in which expected values change for various criterion levels is the crux of the 
problem. This topic is discussed in more detail in Appendix A. 

Since these earlier investigations, other procedures have been utilized to vary the 
Subject’s criterion. One which seems more straightforward and is certainly successful 
is simply to instruct the subject verbally to adopt different criteria such as lax or very 
strict, or even to instruct the subject to maintain a certain value for P(A). 


1s W. A. Munson and J. E. Karlin, J. Acoust. Soc. Am., 1956, 26, 542. 
19 J. P. Egan, A. 1. Schulman, and G. Z. Greenberg, J. Acoust. Soc. Am., 1959, 31, 768. 
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Comparison of the optimum and obtained criterion levels. This criterion level, f, is the 

equivalent of the criterion level on likelihood ratio. The optimum criterion is obtained by 

assuming normal statistics for both hypotheses. It is equal to [I — P(SN)]J/P(SN), where 
P(SN) is the a priori probability of the signal. 


Measure of detectability. Let us turn now from the question of the criterion 
and its adjustment to another aspect of detection-theory analysis, the measure of 
detectability, and more specifically, whether or not this measure remains relatively 
invariant over different experimental procedures. How one can compare different 
measurements obtained using different experimental procedures is an important ques- 
tion, not only for Ppsychoacousticians but for any scientific enterprise. Let us review 
the evidence on the extent to which detection-theory analysis has permitted such a 
comparison. If we make the usual assumption that the distribution of likelihood is 
normal with equal variance on both hypotheses, as in the situation outlined in the first 
experiment, then the measure of detectability is d’. 

A paper by Swets® has considered the applicability of this detectability index 
for yes-no and forced-choice procedures; he has also compared predicted and obtained 
results using two, three, four, six, and eight alternatives in the forced-choice procedure. 
In general, these predictions based on d’ hold up remarkably well. The worst failure 
reported seems to be about I db; no consistent error trend is evident in the data. 

Another method of generating ROC curves, first suggested by Swets et al.*! has 
been employed. Egan et al. tested and compared this method with the standard 
Yes-no procedure. In the single observation or Yes-no procedure, the decision-theory 
analysis claims that the subject adopts a single criterion and this determines a “yes™ 
or “no” response. The experimenter, then, is employing the subject as a threshold 
device. Alternatively, the experimenter could have the subject report a number after 
each observation such as likelihood ratio: from these numbers, the experimenter could 
Construct an ROC curve by placing various criteria on the likelihood ratios reported. 

The rating procedure is a compromise between these two extremes. The subject 


in the rating procedure is asked to place each Observation in one of several categories; 


“J. A. Swets, J. Acoust. Soc. Am., 1959, 31, SII. 


! J. A. Swets, W. P. Tanner, and T. G. Birdsall. Electronic Defense Group, University 
of Michigan, Technical Report No. 40. 1955. 
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the top one being used for sureness of a signal's presence, the next for a lesser degree of 
sureness, and so forth. ROC curves are subsequently constructed. One can then com- 
pare the measure of signal detectability obtained from these two procedures, yes-no 
and rating. Egan et al. found these two measures differed for his three subjects by 
0.3, 0.4, and 0.1 db, differences probably well within the experimental error. 

In summary then, we have seen how decision analysis allows one to predict 
within a fairly wide range of psychoacoustic procedures. The forced-choice procedures 
using two to eight alternatives and a single-interval procedure using two to four 
categories of response can be summarized by a single measure of detectability, a 
measure which, for practical purposes, is invariant. 

Implications for psychoacoustic methods. The more traditional methods of 
Ppsychoacoustics utilize some parameter of the signal such as the threshold energy. 
This value is obtained by an analysis of the subject's responses. Many of these methods 
do not allow one to determine directly the subject's criterion and in most methods it 


is presumed to be constant. 


Let us investigate how variation in the subject's criterion, if it occurs, will affect 


the estimate of the threshold energy. Variation of the subject's criterion affects the 
false-alarm rate Py(A). Figure 4 shows how the probability distribution for signal- 
plus-noise must be varied as the false-alarm rate P(A) is changed to maintain a con- 
stant value of signal detection Psx(A). We have assumed Gaussian distribution and 
uct the solid line of the figure. The insert displays the essentials 
(4) of from 0.10 to 0.01 necessitates 
n order to maintain 


equal variance to constr 
of the calculations and shows how a change in Px 
a change in the mean of the signal distribution from L3 to 3.11; 
PSs(A) = 0.50. This value of Psx(A)isa reasonable one since it is often used as the 
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Evaluation of how a change in criterion will influence the size of the “threshold™ signal. 
P(A) is the false-alarm rate; a “yes™ response to no signal. The hit rate, Psy(A), was held 
constant at 0.5. The mean of the signal distribution was varied (see insert) to achieve this hit 
rate for various values of Px(A). The constant, C, was chosen so that I0log 1L.3+ C= 0. 
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estimate of “threshold.” Very small values of false-alarm rate were used because most 
methods control this parameter to the extent of keeping it very low. 

We cannot say generally how this change in the mean of the signal distribution 
is related to any signal parameter. However, for sinusoidal signals in noise, d’ is 
roughly proportional to signal energy: thus the “estimated threshold" may vary over a 
6-db range depending on the criterion of the subject. (In other experiments dd’ 
varies with signal voltage—hence the range might be 12 db. See Fig. 7 and the dis- 
cussion.) 

This change in the estimated threshold, of say 6 db, will only occur if the sub- 
Ject's criterion changes. One may be willing to assume that it is approximately con- 
stant over the course of the experiment.** Then this number, 6 db, could be interpreted 
as a tolerable difference in comparing two sets of different measurements. The theory, 
then, is consistent with the rather wide-spread view in psychoacoustics; namely, that 
results obtained using different methods should not be expected to show exact con- 
gruence. Whether these differences are large enough to warrant concern depends both 
on the particular nature of the problem and the precision desired. 

Decision analysis and speech research. The use of ROC curves and the measure 
d’ has not been limited to detection experiments. Since some confusion has been 
generated by the multiplicity of d’ measures, this issue deserves some attention. 

Figure 5 displays an ROC curve taken from a report by Egan.*3 The similarity 
between this figure and Fig. 2 is apparent, even though measures employed to construct 
this graph differ greatly. The Procedure here is as follows: A word is presented in 
noise to a listener who writes down the word he thinks was presented. He then checks 
whether or not he believes this identification response is correct. The conditional 
probabilities of the receiver saying he was correct on those words where he in fact was, 
and was not correct, define the ordinate and abscissa respectively of Fig. 5. 

Egan's ROC curve, then, is constructed from a table of response-response 
contingencies rather than from Stimulus-response contingencies, as was the ROC curve 
presented earlier. This difference, from the standpoint of analysis, is by no means 
trivial. The method used by Egan is really a two-stage decision process. First, the 
Observer has to select (from several possibilities) the most likely word: second, he must 
evaluate this decision with respect to all other possibilities. Such a process produces 
mathematical expressions virtually impossible to evaluate except under the most doubt- 
ful set of simplifying assumptions. 

This difficulty does not, of course, prevent one from summarizing the data 
Presented in Fig. 5 by a single parameter. The line drawn to the data points is that 
Benerated by moving a criterion along two normal deviates of the same variance which 
differ only in means. This measure was, unfortunately, initially labeled d’ because of its 
analogy to the detection measure. It is unfortunate because the detection measure d’ 
has often been specifically related to physical measurements of signal and noise. No 


Hp Obviously one can only assume it is constant because one cannot directly measure 
Probabilities of the order 10 2. If one is not willing to make this assumption, one must raise the 
false-alarm rate to a measurable value, Px(A) > 107, or use one of the other techniques 
discussed in the Previous section. The signal energy necessary to obtain a certain d’, say 


4d’ = 1, could then be used as the counterpart of the threshold energy. 


23 i icati ! 
J. P. Egan, Hearing and Communication Laboratory, Indiana University, Technical 
Report under contract, 1957. j 
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FIGURE 5 
Some data taken from footnote 23. The signal-to-noise ratio refers to the peak signal power 
of the word compared witn the noise power. The points represent different subjects. The 
subject listens to a word in noise, guesses what word it wa and then grades that response as 
either being correct (acceptance) or incorrect (rejection). The abscissa and ordinate refer to 
the probability of acceptance given the word was correctly or incorrectly identified. 


such identification was ever intended in speech work, and therefore these measures 


Obtained in speech research are presently denoted by various subscripts.*! 

The importance and usefulness of such measures is reviewed thoroughly in the 
monograph by Egan® and in the work of Pollack. 2° Basically, these measures are 
all aimed at specifying the subject's criterion. Foran interesting example of how this 
value of the criterion affects the substantive conclusion one might draw, the paper by 
Pollack is recommended. A recent paper by Clarke* has illustrated how confidence 


ratings may be utilized to supplement the usual articul 


ation index. 


#1 As yet, no standard notation has evolved. The following list of references contains 
many of the proposals that have been advanced to clarify this confusion. At present, one must 
very carefully determine how the detectability measure is defined in each experiment. Even 
Subscripted measures, d,’ in particular, are defined differently in different experiments. See 
F. R. Clarke, T. G. Birdsall. and W. P. Tanner, J. Acoust. Soc. Am., 1959. 31, 629; J.P. Egan, 
G. Z. Greenberg, and A. Schulman, Hearing and Communication Laboratory. Indiana 
University, Technical Report under contract, 1959; and I. Pollack, J. Acoust. Soc. Am., 
1959, 31, 1031. 

25 |. Pollack, J. Acoust. Soc. Am., 1959, 31, 1500. 

ss |. R. Decker and I. Pollack, J. Acoust. Soc. Am., 1959, 31, 1327. 

2: |, Pollack and L. R. Decker, J. Acoust. Soc. Am., 1958, 30, 286. 

28 |. Pollack. J. Acoust. Soc. Am., 1959, 31, 1509: 

29 F. R. Clarke, J. Acoust. Soc. Am., 1960. 32, 35. 
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Theory of Ideal Observers 


In the most general sense, an ideal observer is simply a function relating an 
Observation to the likelihood of that Observation. Thus we have already specified an 
ideal observer for our simple example, since Table I accomplishes this task. This isnot 
an interesting example, however, because the observations were already specified in terms 
of the probabilities under each hypothesis. A more interesting example of an ideal 
Observer arises where the observations are waveforms and where the characteristics of 
the waveform differ under each hypothesis. The task of the ideal observer is, then, given 
a waveform, calculate likelihood ratio or some monotonic transformation of that 
quantity. A 

The ideal observer, strictly speaking, need not make any decisions. If likelihood 
ratio is computed, the problem of what decision rule to employ is determined by the 
specific objective in making the decisions. Various possible objectives have been dis- 
cussed in the previous sections, where it was pointed out that these Objectives could be 
attained by using a decision rule based on likelihood ratio. Although the calculation of 
likelihood ratio Specifies the ideal observer for a given problem, such information is of 
little value unless we can evaluate this Observer's performance. One general method of 
evaluating the ideal observer's performance is to determine ROC Curves, but to obtain 
an ROC curve we must calculate two probabilities. Thus to evaluate completely the 
ideal observer we actually have to specify not only how likelihood is calculated but the 
probability distribution of likelihood ratio on both hypotheses. 

Having established the general background Of this problem, let us consider a 
specific example: the ideal observer for conditions of a signal which is known exactly. 


Ideal observer for the signal known exactly (SKE) 


Two hypotheses actually define this special case in which, given a waveform, 
One must select one of the following hypotheses: 


Hi—the waveform is a sample of white Gaussian noise mr) with specified 
bandwidth (W) and noise power density (No). 


H,—the waveform is A(t) plus some specified signal waveform s(1). Everything 
is known about s(t) if it occurs: its Starting time, duration, and phase. It 


need not be a segment of a sine wave as long as it is specified, i.e., known 
exactly. 


From these two hypotheses we wish to calculate likelihood ratio, and, if possible, derive 
the Probability distribution of likelihood ratio on both hypotheses. Obviously such 
calculations will be of little use unless the final results can be fairly simply summarized 


in terms of some simple physical measurement of signal and noise. Happily, such is the 
Case. 
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Representation of the waveform 

The assumption concerns the representation of the waveform. In order to com- 
pute likelihood ratio, one must find the probability of a certain waveform on each 
hypothesis. Since the waveformis simply afunction of time, one must somehow associate 
a probability with this waveform, or somehow obtain a set of measures from the wave- 
form and associate a probability with these measures. 

But what exactly is the nature of the waveform? In order to compute these 


various probabilities we must make some very specific assumptions about the class 
of waveforms we will consider. 


Peterson, Birdsall, and Fox’ assumed that the waveforms were Fourier series- 
band limited. If the waveform is of this class it can be represented by n = 2WT 
measures, where Wis the bandwidth" of the noise and T is the duration of the wave- 
form. A series representation in terms of sine and cosine might be used. There are 
of course, many equivalent ways of writing this series to identify the n parameters, but 
these are all unique, and if the original waveform is indeed Fourier series-band limited, 
they will reproduce exactly the waveform in the interval (0, 7). Accepting this assump- 
tion, we find that a monotonic transformation of likelihood ratio (the logarithm) is 


normal under both hypotheses. 


Hi: logl() is normal with mean —E|No, variance E/N, 
Hs: logit) is normal with mean +E|No, variance E/N, 


4’ = SMjs = (2E/No)* where E is the signal energy, IEISOOP dr, and No is the noise 
power density. Naturally, if this assumption about the waveform is not made the 
preceding result is invalid. Mathews and David®™ have considered a slightly different 
assumption. They assumed the waveforms are Fourier integral-band limited. The 
conclusion resulting from this assumption is that the signal is perfectly detectable in the 
noise independent of the ratio E/No, as long as it is not zero. In short, d’is infinite 
for any nonzero value of E/N. Which Gf these assumptions is the more reasonable or 
applicable to a psychoacoustic experiment? | 
Neither assumption can be completely justified. In almost all psychoacoustic 
the noise voltage is actually produced by a special tube. The voltage 
amplified and filtered. Such noise is not Fourier series-band 
not periodic. Although a Fourier series might serve 


experiments, j 
produced by this tube is 
limited, for the noise is clearly 
and E. E. David, J. Acoust. Soc. Am.. 1959, 31, 834(A). 

Peterson. Birdsall, and Fox assumed the noise was 
periodic. Their assumption, strictly speaking, was that each waveform could be represented by 
a finite set of numbers. The way they obtained these numbers is through a sampling plan, which 
we cannot discuss in detail. It was not a simple Fourier expansion in terms of sine and cosine. 
This is a difficult and complex topic; fora discussion of the details in this area see footnote 7; 
D. Slepian, “Some comments on the detection of Gaussian signals in Gaussian noise." Trans. 
IRE. PGIT-4, 65 (1958): and W. B. Davenport and Ww. L. Root. Random signals in noise. New 
York: McGraw-Hill, 1958. Precise analysis of the situation where the noise is filtered, i.e., 
where the power spectrum of the noise is a polynomial. can be Worked out in principle. The 
analysis is complex and exact answers can be obtained only in certain simple cases. One can 
show in general, however, that for practical situations the detectability of the signal is finite. 


(See Davenport and Root.) 


a0 M. V. Mathews 
mm |tis somewhat unfair to imply that 
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as an excellent approximation to these waveforms in the interval (0, T), it would nist 
be an exact representation of the waveform. Similarly, an assumption of a Fourier 
integral limitation of the bandwidth cannot be correct, because the waveform does not 
have a sharp cutoff in the Fourier integral sense. If it did, the waveform would be 
analytic. If it were analytic, the ideal observer could sample at one point in time, obtain 
all the derivatives at that point, and know the exact form of the wave for all time. Such 
a result leads to the conclusion that the ideal Observer, by Observing one sample of the 
Waveform at any time can, immediately, in principle, make his decision about all the 
waveforms the experimenter has presented in the past and all those he may ever decide 
to produce. This approach is therefore of little practical use. 

The issue, while obviously only an academic one, has indicated one very 
important aspect of the problem. The ideal observer is, like all ideal concepts, only as 
£OOd as the assumptions that generate it. Clearly, any such idealization of a practical 
situation is based on certain simplifying assumptions. It is always extremely important 
to understand what these assumptions are and even more important to realize the 
implications of a change in these assumptions. In short, there are many ideal Observers, 
each generated by certain key assumptions about the essential nature of the detection 
task. ; 

For the discussion which follows, we shall use the Peterson, Birdsall, and Fox’ 
approach and assume that the waveforms can be completely represented by a He 
number of measurements. A similar treatment is given by Van Meter and Middleton." 
As more progress is made with the theory of ideal observers we should be able to state 
quite precisely how detection will vary if certain definite restrictions are imposed on the 
manner in which the observer operates. Peterson, Birdsall, and Fox have, in fact, con- 
sidered several such cases and their results. Each case provides us with a framework 
from which we may evaluate and assess the performance of the subject. Such a ৰ 
parison provides both qualitative aud quantitative guides for further research." 
There are several areas we might select to illustrate this approach. The one we have 
selected was chosen because it is a general topic and because it has been slighted some- 
what in psychoacoustics. 


Shape of the Psychophysical function 


The psychophysical function is generally defined as the curve relating the per- 
centage of correct detections of the signal (the ordinate) to some physical measure of the 
signal (the abscissa). If some variant of the constant stimuli method is used, the curve 
rises monotonically from zero to one hundred percent as the signal level is increased. 

Generally, hypotheses about the form of this function arise from assumptions 
about the Process of discrimination. Often these assumptions are sufficient to allow 
One to deduce the form of the Psychophysical function to within two or three param- 
eters which are then determined experimentally. Obviously, it is extremely important 
for the model to Specify the exact transformation of the physical stimulus which is used 
as the abscissa of the Psychophysical function; without such specification, the theory 
1s incomplete. 

In PSychoacoustics, there has been comparatively little concern with the form of 
this function. Most theories of the auditory process have been content with attempting 

# D. Van Meter and D. Middleton. Trans. IRE. 1954, PGIT-4. 119. 

# W. P. Tanner and T. G. Birdsall, J. Acoust. Soc. Am., 1958, 30, 922. 
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to predict only one parameter of the psychophysical curve, usually the mean or 
threshold. As aresult, itis nearly impossible to Obtain from the literature information 
on the actual form of the psychophysical function. 

The notable exception to the preceding statement is the neural-quantum 
hypothesis.#! The authors of this theory say that it “enables us to predict the form and 
the slope of certain psychometric functions.” It can be demonstrated from the model 
that the form of the function should be linear and this linear function is specified to 
within one parameter. The physical measure is never mentioned in the derivation of the 
theory and we find only after the data are presented that sound pressure and frequency 
are the appropriate physical measures. The authors remark in their paper that “strictly 
speaking, data yielding rectilinear psychometric functions when plotted against sound 
pressure do not show absolute rectilinearity when expressed in terms of sound energy, 
but calculation shows that the departure from rectilinearity is negligible.” Itis certainly 
true that pressure, pressure squared, and indeed pressure cubed, are all nearly linear 
for small values of pressure—but that is not entirely the point. 

It is the location of this function that plays a crucial role in the theory. If the 
subject employs a two-quantum criterion then, according to the theory, the psycho- 


physical function must be zero up to one quantum unit, show a linear increase to one 


hundred percent at two quantum units, and maintain this level for more quantum units. 
and where it reaches one hundred 


Where the curve breaks from zero percent reports 

percent reports is precisely specified by the theory. In general, if the subject requires 
n quanta to produce a positive report, the increasing linear function must extend from 
nton +l quantum units. Now clearly, what appears to be a two-quantum subject 
(0°, at one pressure unit, 100°, at two pressure units), when the data are plotted in 
pressure units, cannot be interpreted as a two-quantum subject in energy units. In fact, 
he cannot be interpreted as an any-number-of-quantum subject. This is true no matter 


how small the values of pressure. 
This criticism of the rather post hoc treatment of the physical scale is by no 


means limited to the neural-quantum hypothesis. Many hypotheses about the shape 
of the psychophysical function, including some formulations of the Gaussian hypothesis, 
neglect this rather crucial factor. | 

Detection theory stands in marked contrast with these theories. Models based 
on the ideal observer concept predict the form of the psychophysical function exactly. 
The proper physical dimensions are completely specified and there are no free param- 


Eters. 


Obviously, one would not be surprised to find human observers somewhat less 


than optimum, but hopefully, the shape of the psychophysical function might at least 
be parallel to that obtained from the model. Often however, the obtained psycho- 
physical function does not parallel that predicted by the model and this discrepancy 
deserves some discussion. 

Signal uncertainty and ideal detectors.®# In Fig. 6, we have plotted the per- 


centage of correct detections in a two-alternative forced-choice procedure versus 


31 5. 5. Stevens, C. T. Morgan, and J. Volkmann, Am. J. Psychol., 1941, 54, 315. 
25 The analysis of detection data from the viewpoint of signal uncertainty is very similar 
to some ideas expressed by Dr. W. P. Tanner. Although several details of the analysis differ, 
the essentials are the same. The author is indebted to Dr. Tanner for many long and lively 


conversations on this topic. 
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FIGURE 6 
The theoretical Psychophysical functions for the ideal Observer detecting 1 of M-orthogonal 
signals. The parameter M is the number of possible orthogonal signals. The ideal detector 
need only detect the signal, not identify it. The abscissa is ten times the logarithm of signal 
energy to noise-power density. The ordinate is the Percent correct detection in a two-alternative 
forced-choice test. The Obtained data are compared with the theoretical function shifted about 
10 db to the right. 


$- A‘ for a typical subject and a series of mathematical models. The problem in all cases 
is simply to detect a sinusoidal signal added to a background of white noise. { 

We say “typical Subject” because the shape of this function is remarkably in- 
Variant over both subjects and a range of physical parameters. For signal durations 
of 10 to 1000 msec and signal frequencies from 250 to 4000 CPs,” there appears to be 
10 great change in the shape of the function when plotted against the scale shown in 
Fig. 6. Naturally, the exact location of the curve depends on the exact physical param- 
eters of the signal, but except for this constant, which is a simple additive constant in 
logarithmic form, the shape is remarkably stable. The striking aspect of this function 
is its slope. We notice the slope of the observed function is steeper than most of the 
theoretical functions depicted in Fig. 6. 

The class of theoretical functions is generated by assuming the detector has 
Various uncertainties about the exact nature of the Signal. Each function is generated 
by assuming the detector knows only that the signal will be one of M-orthogonalsignals. 
If the signal is known exactly (M = 1) there is no uncertainty. For sinusoidal signals, 
the nature of the uncertainty might be phase, time of occurrence of the signal, or 
Signal frequency. The degree of uncertainty is reflected by the parameter M. As this 


* D. M. Green, J. Acoust. Soc. Am., 1959, 31, 8360A). 

“ D. M. Green, M. J. McKey, and J. C. R. Licklider, J. Acoust. Soc. Am., 1959, 31, 1446. 

~The details of this model may be found in footnote 7, p. 207. This particular model 
Was selected because it has been presented in the literature. There are other models which 
ASSUME: Signal Uncertainty but which differ in details about the decision rule. The psycho- 
Physical functions Produced by these models are similar to those displayed in Fig. 6, although 
the value of the parameter (M) would be changed somewhat. 
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uncertainty increases, the psychophysical function increases in slope. It therefore 
appears that there may exist a model with sufficient uncertainty about the signal to 
generate a function which is very similar to that displayed by the human observer. 

Accepting for the moment the assumption that the extreme slope of the human 
Observer's psychophysical function is due to some degree of uncertainty about the 
signal, we might try to manipulate this slope by various experimental procedures. 

Preview technique. One general class of procedures would attempt to reduce 
the uncertainty by supplying the missing information through some form of cueing or 
preview technique. If, for example, the observer is uncertain about the frequency of the 
signal we might attempt to reduce this uncertainty by presenting the signal briefly at a 
high level just prior to the observation interval. Similarly, if the time of occurrence of 
the signal is uncertain we might increase the noise during the observation interval. If 
the noise was increased for all trials, whether or not the signal was presented, it would 
provide no information about the signal's presence but would convey direct informa- 
tion about the signal's starting time and duration. Both of these techniques have been 
utilized with only partial success. While it is impossible to assert that there was no 
change (the null hypothesis) the amount of change was very small, although in the 
proper direction.® 

Another class of procedures which has been utilized to attempt to reduce the 
about the signal parameters involves changing the detection task 
supplied. The procedures are like the preceding 
observation interval. For example, to 
ave to the noise. The 


subject's uncertainty 
so that some information is directly 
but actually include the information in the 
remove frequency uncertainty, we might add a continuous sine Ww 
continuous sine wave is adjusted to a level such that it is clearly evident in the noise. 
The signal is an increment added to this sine wave and the task is to detect this incre- 
ment. The procedure definitely changes the slope of the subject's psychophysical 
function—it becomes less steep and the signal is easier to detect.0 

This procedure of making the signal an increment to a continuous sine wave 
provides good frequency information but does not remove temporal uncertainty. 
Another procedure which minimizes practically all uncertainty isin fact a modification 
of a standard procedure used to investigate the fn.d. for intensity. A two-alternative 
forced-choice procedure is employed. Two gated sinusoids occur in noise, one at 
standard level, the other at this level plus an increment. The subject's task is to select 
the interval containing the increment. If the standard signal is adjusted to a power 
level about equal to the noise-power density, the psychophysical function actually 
parallels that expected for the signal-known-exactly case. It is from 3 to 6 db off 
optimum in absolute value, depending on the energy of the standard. (See Fig. 7. 
Note the change in scale between Figs. 6 and 7.) 

Let us, at least tentatively, accept as the conclusion of these last results that the 
shape of the psychophysical function is in fact due primarily to various uncertainties 
about the signal parameter. If this is true, then we still have the problem of explaining 


% Unpublished work of the author. Also see T. Marill, Ph.D. thesis, Massachusetts 
Institute of Technology, 1956, and J. C. R. Licklider and G. H. Flanagan, “On a methodo- 


logical problem in audiometry," unpublished. 
40 W. P. Tanner, J. Bigelow, and D. M. Green, unpublished. 


41 W. P. Tanner, Electronic Defense Group, University of Michigan, Technical Report 
No. 47, 1958. 
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the lack of success evidenced when the previous techniques were employed. Should 
not a preview of the signal, preceding an observation, serve to reduce frequency un- 
certainty? The answer might be that such procedures do reduce uncertainty, but not 
enough relative to the uncertainty still remaining. From Fig. 6 we note that, as we 
introduce signal uncertainty, the slope of the psychophysical function increases very 
rapidly for small changes in uncertainty: then, as the uncertainty increases, the slope 
approaches some asymptotic value. A change in uncertainty from M = 256 to 64 may 
hardly affect the psychophysical function. This fact also probably explains why the 
psychophysical functions do not appear to change very much for a variety of ignal 
parameters, such as signal duration and signal frequency. Undoubtedly, as the signal 
duration increases, the uncertainty about the time of occurrence of the signal is re- 
duced. Due to the large initial uncertainty, this change is too small to be detected in 
the data. 

Uncertain signal frequency. Still another manner of checking this general 
modelis to vary the uncertainty of the signal and determine how this affects the subject's 
performance. One might, for example, select several different sinusoidal signals and 
select one at random as the signal used on a particular trial. The subject is simply asked 
to detect a signal, not identify it. Depending on the frequency separation and the 
number of signals used, one can directly manipulate signal uncertainty. 
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Observed data in the AI versus I experiment and the signal-know n-exactly observer (M = 1}: 


Tie Abscissa and ordinate are the same as in Figure 6. but note the change in scale of the 
“Abscissa. The two curves differ by 6 db at each value of percent correct. The apparent con- 
vetEenceof si two curves at low values of percent correct is illusory. The insert shows the 
level of the noise; the lines Show the level of [in power, and the maximum I + AI power . 
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FIGURE 8 
The variation of signal-to-noise level for some constant percent correct as a function of M. 
This curve is the same information presented in Figure 6 with M as the variable and percent 
correct at the parameter. 


This in fact was a procedure used in an earlier study by Tanner et al.¥® A small 
decrement (1.0 to 1.5 db) in detectability was found if one compared a situation where 
a single fixed sinusoid was the signal and a situation where the signal was one of two 
sinusoids. Later results! 1 show, however, that the decrement did not increase very 
much as more components were included in the set of possible signals. This result is 
consistent with the theoretical model we have been discussing. Figure 8 shows how, 
for a constant detectability, one must change the signal level as uncertainty (M) is 
increased. The decrement in signal detectability as a function of signal uncertainty 
changes very slowly after M reaches a value of 50 or so. The 1.5 db per octave decre- 
ment, suggested by some of the earlier models to account for the uncertain frequency 
data? is only a reasonable approximation for a rather limited range of M.18 

While the preceding argument that the shape of the psychophysical function is largely 
due to signal uncertainty has some appeal, there still remain some problems with this 
interpretation. Another way to attack this problem of signal uncertainty is to use a 
signal where little information about the waveform is known, and compare the subject's 
performance with the theoretical optimum model in this situation. A specific case 
arises where the signal is a sample of noise. The most one can specify about the signal 
is the frequency region, starting time, duration, and power. The ideal detector for this 
signal can be specified _it simply measures signal energy in the signal band. But the 


1 F, A. Veniar, J. Acoust. Soc. Am., 1958, 30, 1020. 

13 F. A. Veniar, J. Acoust. Soc. Am., 1958, 30, 1075. 

11 C. D. Creelman, Electronic Defense Group, University of Michigan, Technical 
Memo. No. 71, 1959. 

45 D. M. Green, J. Acoust. Soc. Am.. 1958, 30, 904. See also footnotes 16 and 44. 

1 J. P. Egan, G. Z. Greenberg, and A.L. Schulman, J. Acoust. Soc. Am., 1959, 31, 
1579(A). Egan et al. have investigated how temporal uncertainty affects signal detectability. 
In one condition they present a fixed-frequency sinusoidal signal of 0.25 sec duration some- 
where in an 8-sec interval. They did not report the results in detail, but the decrement in 
detectability due to temporal uncertainty was small (1 or 2 db). 
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Psychophysical functions obtained with this ty pe of signal are also slightly steeper than 
those predicted by the model.’ Either partial time uncertainty still remains or signal 
uncertainty aloneis not a sufficient explanation. The author feels that a better model 
would assume that the human observer utilizes some nonlinear detection rule. This 
assumption, coupled with the uncertainty explanation, could probably explain most of 
the results obtained thus far. The mathematical analysis of such devices, is however, 
complex. 

Internal noise. Before summarizing, one final point must be considered. 
Often it is a temptation to invoke the concept of internal or neural noise when dis- 
cussing the discrepancy between an ideal model and the human observer. There are 
Bood reasons for avoiding this temptation. While it would take us too far afield to 
Cover this point in detail, the following remarks will illustrate the point. 

Only if the model is of a particularly simple form can one hope to evaluate the 
specific effects of the assumption of internal noise. The signal-known-exactly observer 
is of this type. Here one can show how a specific type of internal noise can simply be 
treated as adding noise at the input of the detection device. Thus one can evaluate the 
Psychophysical function and it will be shifted to the right by some number of decibels 
(see Fig. 6) due to the internal noise. But, of course, such an assumption can immedi- 
ately be rejected since no shift in the Psychophysical function can account for the data 
displayed in the figure. 

With more complicated models, it is usuall 
noise will do. While it will Obviously lower discrimination, the specific effects of the 
assumption are often impossible to evaluate. Unless these specific effects can be 
evaluated, the assumption simply rephrases the Original problem of the discrepancy. 
t I am not suggesting that the human observer is perfect in any sense, nor attempt- 
Ing to minimize the importance of the concept of internal noise. What I am emphasiz- 
Ing is that the concept must be used with great care. If the concept is to have any 
Importance it must be made specific. This implies that we have to (1) state exactly 
What this noise is, i.e., that we have to characterize it mathematically, (2) specify in 
What way it interacts with the detection or discrimination process, and (3) evaluate 
Specifically what effect it will have on performance. Unless these steps can be carried 
Out the ad hoc nature of the assumption vitiates its usefulness. 


y difficult to say exactly what internal 


Summary and Conclusion 


K The main emphasis in this paper has been to explain detection theory and to 
illustrate how such a theory has been applied to certain areas of psychoacoustics. 
This method of analysis is simply one of many that are currently being used in an attempt 
to understand the Process of hearing. 

Two main aspects of this approach have been distinguished. The first, decision 
theory, emphasizes that the subject's criterion as well as the physical properties of the 
Stimulus play a major role in determining the Subject's responses. The theory indicates 
DOfh the ‘class of variables which defeimines the level of the criterion, and, more 
Importantly, Suggests an analytic technique for removing this source of variation. This 
technique leaves a relatively pure measure of the detectability of the signal. The 
Invariance of this measure over several psychophysical procedures has already been 
demonstrated. 


‘7 D. M. Green, J. Acoust. Soc. Am., 1960, 32, 121. 
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FIGURE 9 
The normalized expected value as a function of changes in criterion. This is a theoretical 
curve based on the data presented in Figures 2 and 3. The appendix lists the assumptions used 
to construct the curve. 


The second aspect, the theory of ideal observers, has also been discussed in some 
detail. The usefulness of such an analysis was illustrated by considering the form of the 
psychophysical function. No ideal observer provides a complete or comprehensive 
model even for the rather limited areas of psychoacoustics that we have discussedin this 
paper. The model provides a source of hypotheses and a standard against which experi- 
mental results can be evaluated. It is too early to attempt any complete evaluation of 
The mathematical models are relatively new and the application of these 
ith Tanner and Swets’S only about five years ago. 
ed both of a mathematical and experimental 
areas, the theory should become more specific 
act more directly with the research 


this approach. 
models to a sensory process began w 
There remain many problems to be solv 
nature. As more progress is made in both 
and concrete, then perhaps it will be able to inter 
from several other areas in psychoacoustics. 
Appendix A 

aring the optimum criterion value and that 
employed by the subject is the shape of the expected-value function. Let us investigate 
in detail a typical situation. We have assumed that the distribution on likelihood ratio 
is normal under both hypotheses, that the mean separation is one sigma unit, and 
that the values and costs of the various decision alternatives are all the same. From 
these assumptions we have constructed Fig. 9. This figure shows how the expected 
value varies with changes in a priori probability of signal P(SN) and false-alarm rate 
P(A). We see immediately that for extreme values of a priori probability, e.g., P(SN) = 
0.10, the difference between optimum expected-value behavior [Px(A) = 0.004] and 


4s W. P. Tanner and J. A. Swets, Psychol. Rev., 1954, 61, 401. 


The inherent difficulty of comp. 
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a pure strategy [Py(A) = 0.000] is less than 3°,5. In fact, the curves in the figure were 
somewhat exaggerated to allow one to see the location of the maximum. Since most 
subjects are instructed to avoid pure strategies in psychoacoustic experiments, this 
tends to force the Subject to adopt more moderate values of P(A) for extreme condi- 
tions. 

On the other hand, if more moderate a priori probabilities are employed in the 
experiment [e.g., P(SN) = 0.50], we see that any value of PS(OA) within a range from 
0.15 to 0.50 will achieve at least 90°; of the maximum expected payoff. 

Thus any attempt to investigate, in any more than a correlational sense, the 
correspondence between obtained and optimum criteria appears extremely difficult. 


Received June 23, 1960. 


SOME COMMENTS AND A CORRECTION OF 
“PSYCHOACOUSTICS AND DETECTION THEORY ™™* 


DaAviD M. GREEN 


DEPARTMENT OF ECONOMICS AND RESEARCH LABORATORY OF ELECTRONICS, 
MASSACHUSETTS INSTITUTE OF TECHNOLOGY, CAMBRIDGE, MASSACHUSETTS 


Dr. 5. 5. Stevens has very kindly pointed out two items in my paper, “Psycho- 


acoustics and Detection Theory," that require further comment in order to avoid 


misunderstanding. 
I called the function relating the percentage of correct detection responses to the 


stimulus the psychophysical function. It is true that this 


physical intensity of the 
lled the psychometric function, a term probably introduced 


function is more often ca 


by Urban in 1908.* 
Originally Fechner added up successive just-noticeable-differences (jnd's) to 


determine the relation between the magnitude of sensation and the physical intensity 
of the stimulus. The resulting relation is commonly called the psychophysical function. 
Since Fechner’s time many other techniques for determining this relation have been 
devised and the results are also called psychophysical functions (e.g., Stevens’ power 
laws). The newer methods do not involve determining jnd’s and are not obtained by 
using any simple variant of the classical methods of psychophysics. We are therefore 
faced with the anomaly that psychometric functions are obtained by using psycho- 
physical methods and psychophysical functions are now determined by other, different 


techniques. 
Personally T find the designation u 
eful than the term psychometric function. 


minology would be most welcome. I am open for suggestions. 

The second item is more crucial and concerns my remarks about the neural- 
quantum theory. I asserted that data that appear to indicate a two-quantum observer 
when plotted against pressure units cannot be interpreted as any kind of quantum 
observer when plotted against energy units. There is, however, a very straightforward 
interpretation of the scales of pressure and energy that makes this assertion incorrect. 
Unfortunately, this interpretation had never occurred to me, and I thereby did injustice 
to the authors of the neural-quantum theory. Let me explain this interpretation and 
the scale of pressure and energy units that I had in mind when I made my remarks. 

In the neural-quantum procedure we have a continuous sinusoidal stimulus 
(call it the standard). At specific times we increase briefly the amplitude of this sinusoid 
and the observer's task is to detect these increments. If we measure the pressure of the 
standard, call it p, and measure the pressure of the standard plus the increment, call it 


From J. Acoust. Soc. Amer., 1961, 33, 965. Reprinted with permission. 

* The preparation of this letter was supported by the U.S. Army Signal Corps, the Air 
Force (Operational Applications Office and Office of Scientific Research), and the Office of 
Naval Research. This is Technical Note No. ESD TN 61-56. 

1D. M. Green, J. Acoust. Soc. Am., 1960, 32, 1189. 

2 F. M. Urban, The application of statistical methods to the problems of psychophysics, 
Philadelphia: Psychological Clinic Press. 1908, p. 107. jl | 

3 5. 5S. Stevens, Psychol. Rev., 1957, 64, 153. 


sed in vision—frequency-of-seeing curve— 


even more distast Some change in ter- 
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P + Sp, then by subtracting the former from the latter we obtain on a pressure scale 
values of lp. We may call this quantity Sp the increment of pressure. 

Similarly, if we measure the power of the standard, a quantity proportional to 
BS and the power of the standard plus the increment a quantity proportional to 
(p + Ap), we might subtract the former from the latter, and (since the constants of 
proportionality are the same) obtain the quantity, (p* + 2Spp + \p* - Pp) = 
(2Npp + Sp). The latter quantity is also proportional to energy, since the increment 
is of constant duration, and we may call this quantity the increment of energy. The 
important result is that these two quantities, the increment in pressure and the increment 
in energy, are nearly linear for values of Sp much less than p. If some data are exactly 
consistent with the predictions of the neural quantum theory on one scale, they would 
very nearly be consistent on the other scale. 

When I made my remarks, I had in mind data plotted on a scale of signal 
Pressure or signal energy. By signal I mean the waveform added to the standard that 
the observers are asked to detect. In this terminology, the pressure of the signal is 
Proportional to Ap and the energy of signal is proportional to that quantity squared, 
Ap. Only data plotted on a scale of signal pressure as I have now defined it are in 
agreement with the predictions of neural quantum theory. 

Part of the reason for my oversight undoubtedly arose from the fact that this 
measure of signal energy Ap? is the quantity I used in presenting some of the data 
reported later in my paper. There is, however, no inherent reason for using my partic- 
ular measure of the stimulus and I should have made my reference clear. 

In some cases the two different scales of energy obtained from the pressure scale 
would be exactly the same. This would happen if the standard and signal are inco- 
herent; that is, if the middle term in the square of (Ap + p) is zero. An example of 
this would be an increment in white noise. In the case at hand, this is not true and the 
quantity that I have called increment in energy and the quantity that I called signal 
energy are quite different. 

The general point I was trying to make is that the neural-quantum theory does 
not specify in advance how the physical stimulus should be measured. It was my 
Position that it is important for a theory of Psychophysics to specify how the physical 
Scale is related to the expected psychological results. This position is apparently not 
Widely endorsed. Iam particularly impressed with the number of theories that suggest 
that the psychometric function is Gaussian, log-Gaussian, Poisson, rectilinear, or 
logistic, but cannot specify in advance what particular transformation of the physical 
Scale will yield these results. Itis not hard to envision different circumstances in which 
all these assertions are true at least in the sense that deviations are within the range of 


experimental error. Somehow there never seems to be any resolution to these different 
findings. 


One can, of course, simply ignore all this and go on measuring only one arbitrary 
Parameter of the Psychometric function such as the “threshold” value. While this 
Position obviously has the merit of convenience, it would also appear important to 
demonstrate how all Of these different results might come about from one single general 
theory. To Accomplish the latter task one must have a theory which carefully specifies 
the physical part of the psychophysical theory. 


Received April 14, 1961. 


ON THE POSSIBLE PSYCHOPHYSICAL LAWS! 


R. DUNCAN LUCE 


Harvard University 


This paper is concerned with the 
century-old effort to determine the 
functional relations that hold between 
subjective continua and the physical 
continua that are presumed to underlie 
them. The first, and easily the most 
influential, attempt to specify the pos- 
sible relations was made by Fechner. 
It rests upon empirical knowledge of 
how discrimination varies with inten- 
sity along the physical continuum and 
upon the assumption that jnd’s are 
subjectively equal throughout the con- 
tinuum. When, for example, discrimi- 
nation is proportional to intensity 
(Weber's law), Fechner claimed that 
the equal-jnd assumption leads to a 
logarithmic relation (Fechner’s law). 

This idea has always been subject 
to controversy, but recent attacks upon 
it have been particularly severe. At 
the theoretical level, Luce and Edwards 


1This work has been supported in part 
by Grant M-2293 from the National Institute 
of Mental Health and in part by Grant 
NSF-G 5544 from the National Science 
Foundation. 


Ward Edwards, E. H. Galanter, Frederick 


Mosteller, Frank Restle, S. S. Stevens, and 
Warren Torgerson have kindly given me 
their thoughtful comments on drafts of this 
paper, many of which are incorporated into 
this version. I am particularly indebted to 
S. S. Stevens for his very detailed substan- 
tive and stylistic criticisms of the last two 
drafts. 


(1958) have pointed out that Fechner's 
mathematical reasoning was not sound. 
Among other things, his assumption is 
not sufficient to generate an interval 
scale. By recasting his problem some- 
what—essentially by replacing the 
equal-jnd assumption with the some- 
what stronger condition that “equally 
often noticed differences are equal, ex- 
cept when always or never noticed” — 
they were able to show that an interval 
scale results, and to present a mathe- 
matical expression for it. Their work 
has no practical import when Weber's 
law, or its linear generalization Ax 
= ax + b, is true, because the loga- 
rithm is still the solution, but their jnd 
scale differs from Fechner's integral 
when Weber’s law is replaced by some 
other function relating stimulus jnd’s 
to intensity. 

At the empirical level, Stevens 
(1956, 1957) has argued that jnd’s are 
unequal in subjective size on intensive, 
or what he calls prothetic, continua—a 
contention supported by considerable 
data—and that the relation between the 
subjective and physical continua is the 
power function «xf, not the logarithm. 
Using such “direct” methods as mag- 
nitude estimation and ratio production, 
he and others (Stevens: 1956, 1957; 
Stevens & Galanter, 1957) have accu- 
mulated considerable evidence to but- 


This article appeared in Psychol. Rev., 1959, 66, 81-95. Reprinted with permission. 
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ter involves relations among two or 
more variables. In practice, substan- 
tive theories are usually stated in terms 
of functional relations among the scales 
that result from the several measure- 
ment theories for the variables involved. 

For a. number of purposes, the scale 
type is much more crucial than the 
details of the measurement theory from 
which the scale is derived. For exam- 
ple, much attention has been paid to 
the limitations that the scale type 
Places upon the statistics one may sen- 
sibly employ. If the interpretation of 
a particular statistic or statistical test 
is altered when admissible scale trans- 
formations are applied, then our sub- 
stantive conclusions will depend upon 
Which arbitrary representation of the 
scale we have used in making our cal- 
culations. Most scientists, when they 
understand the problem, feel that they 
should shun such statistics and rely 
only upon those that exhibit the ap- 
Propriate invariances for the scale type 
at hand. Both the geometric and arith- 
metic means are legitimate in this sense 
for ratio scales (unit arbitrary), only 
the latter is legitimate for interval 
scales (unit and zero arbitrary), and 
neither for ordinal scales. For fuller 
discussions, see Stevens: 1946, 1951, 
1955; for a somewhat less strict inter- 
pretation of the conclusions, 3ee Mos- 
teller, 1958. 

A. second place where the transfor- 
mation group imposes limitations is in 
the construction of substantive theories. 
These limitations seem to have received 
far less attention than the statistical 
questions, even though they are un- 
doubtedly more fundamental. The re- 
mainder of the paper will attempt to 
formulate the relation between scale 
types and functional laws, and to an- 
Swer the question what psychophysical 
laws are Possible. As already pointed 
Out, these issues have scientific rele- 
vance beyond Psychophysics. 


A PRINCIPLE OF THEORY 
CONSTRUCTION 


In physics one finds at least two 
classes of basic assumptions: specific 
empirical laws, such as the universal 
law of gravitation or Ohm's law, and 
a priori principles of theory construc- 
tion, such as the requirement that the 
laws of mechanics should be invariant 
under uniform translations and rota- 
tions of the coordinate system. Other 
laws, such as the conservation of en- 
ergy, seem to have changed from the 
empirical to the a priori category dur- 
ing the development of physics. In 
Psychology more stress has been put 
On the discovery of empirical laws than 
on the formulation of guiding princi- 
ples, and the search for empirical rela- 
tions tends to be pursued without the 
benefit of explicit statements about 
what is and is not an acceptable the- 
Ory.* Since such principles have been 
used effectively in physics to limit the 
possible physical laws, one wonders 
whether something similar may not be 
possible in psychology. 

Without such principles, practically 
any relation is a priori possible, and 
the correct one is difficult to pin down 
by empirical means because of the ever 
present errors of observation. The 
error problem is particularly acute in 
the behavioral sciences. On the other 
hand, if a priori consideration about 
what constitutes an acceptable theory 
limits us to some rather small set of 
possible laws, then fairly crude obser- 


2 Two attempts to introduce and use such 
statements in behavioral problems are the 
combining of classes condition in stochastic 
learning theory (Bush, Mosteller, & Thomp- 
son, 1954) and some work on the form of 
the utility function for money which is based 
upon the demand that certain game theory 
solutions should remain unchanged when a 
constant sum of money is added to all the 
payoffs (Kemeny & Thompson, 1957). In 
neither case do the conditions seem particu- 
larly compelling. 
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vations may sometimes suffice to decide 
which law actually obtains. 

The principle to be suggested appears 
to be a generalization of one used in 
physics. It may be stated as follows. 


A substantive theory relating two 
or more variables and the meas- 
urement theories for these varia- 
bles should be such that: 

1. (Consistency of substantive 
and measurement theories) Admis- 
sible transformations of one or 
more of the independent variables 
shall lead, via the substantive the- 
ory, only to admissible transfor- 
mations of the dependent variables. 

2. (Invariance of the substan- 
tive theory) Except for the nu- 
merical values of parameters that 
reflect the effect on the dependent 
variables of admissible transfor- 
mations of the independent vari- 
ables, the mathematical structure 
of the substantive theory shall be 
independent of admissible trans- 
formations of the independent 
variables. 


In this principle, and in what fol- 
lows, the terms independent and de- 
pendent variables are used only to 
distinguish the variables to which arbi- 
trary, admissible transformations are 
imposed from those for which the 
transformations are determined by the 
substantive theory. As will be seen, 
in some cases the labeling is truly arbi- 
trary in the sense that the substantive 
theory can be written so that any vari- 
able appears either in the dependent 
or independent role, but in other cases 
there is a true asymmetry in the sense 
that some variables must be dependent 
and others independent if any substan- 
tive theory relates them at all. 

One can hardly question the con- 
sistency part of the principle. If an 
admissible transformation of an inde- 
pendent variable leads to an inadmissi- 


ble transformation of a dependent vari- 
able, then one is simply saying that the 
strictures imposed by the measurement 
theories are incompatible with those 
imposed by the substantive theory. 
Such a logical inconsistency must, I 
think, be interpreted as meaning that 
something is amiss in the total theo- 
retical structure. 

The invariance part is more subtle 
and controversial. It asserts that we 
should be able to state the substantive 
laws of the field without reference to 
the particular scales that are used to 
measure the variables. For example, 
we want to be able to say that Ohm’s 
law states that voltage is proportional 
to the product of resistance and current 
without specifying the units that are 
used to measure voltage, resistance, or 
current. Put another way, we do not 
want to have one law when one set of 
units is used and another when a differ- 
ent set of units is used. Although this 
seems plausible, there are examples 
from physics that can be viewed as a 
particular sort of violation of Part 2: 
however, let us postpone the discussion 
of these until some consequences of the 
principle as stated have been derived. 

The meaning of the principle may 
be clarified by examples that violate it. 
Suppose it is claimed that two ratio 
scales are related by a logarithmic law. 
An admissible transformation of the 
independent variable x is multiplication 
by a positive constant , i.e., a change 
of unit. However, the fact that log 
kz = log k + log x means that an in- 
admissible transformation, namely, a 
change of zero, is effected on the de- 
pendent variable. Hence, the loga- 
rithm fails to meet the consistency 
requirement. Next, consider an expo- 
nential law, then the transformation 
leads to e=(e2)*. This can be 
viewed either as a violation of con- 
sistency or of invariance. If the law 
is exponential, then the dependent vari- 
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able is raised to a power, which is 
inconsistent with its being a ratio scale. 
Alternatively, the dependent variable 
may be taken to be a ratio scale, but 
then the law is not invariant because 
it is an exponential raised to a power 
that depends upon the unit of the inde- 
pendent variable. 


AN APPLICATION OF THE PRINCIPLE 


Most of the physical measures en- 
tering into psychophysics are idealized 
in physical theories in such a way that 
they form either ratio or interval 
scales. Mass, length, pressure, and 
time durations are measured on ratio 
scales, and physical time (not time 
durations), ordinary temperature, and 
entropy are measured on interval 
scales. Of course, differences and de- 
rivatives of interval scale values con- 
stitute ratio scales. 

Although most psychological scales 
in current use can at best be con- 
sidered to be ordinal, those who have 
worked on psychological measurement 
theories have attempted to arrive at 
scales that are either ratio or interval, 
preferably the former. Examples: 
the equally often noticed difference 
assumption and the closely related 
Case V of Thurstone’s “law of com - 
parative judgment’ lead to interval 
scales; Stevens has argued that mag- 
nitude estimation methods result in 
ratio scales (but no measurement the- 
Ory has been offered in support of this 
Plausible belief) ; and I have given suf- 
ficient conditions to derive a ratio 
scale from discrimination data. Our 
question here, however, is not how 
well Psychologists have succeeded in 
Perfecting scales of one type or an- 
other, but what a knowledge of scale 
types can tell us about the relations 
among scales. 

shes of to these two common 
in Ghat ae there is some interest 
een called logarithmic 


interval scales (Stevens, 1957). In this 
case the admissible transformations 
are multiplications by positive con- 
stants and raising to positive powers, 
i.e., kxc, where k > 0 and c > 0. The 
name applied to this scale type re- 
flects the fact that log x is an interval 
scale, since the transformed scale goes 
into clog + + log k. We will consider 
all combinations of ratio, interval, and 
logarithmic interval scales. 

Because this topic is more general 
than psychophysics, I shall refer to 
the variables as independent and de- 
pendent rather than physical and psv- 
chological. Both variables will be 
assumed to form numerical continua 
having more than one point. Let 
¥> 0 denote a typical value of the 
independent variable and u(x»)>0 
the corresponding value of the de- 
pendent variable, where 1 is the un- 
known functional law relating them. 
Suppose, first, that both variables 
form ratio scales. If the unit of the 
independent variable is changed by 
multiplying all values by a positive 
constant , then according to the 
principle stated above only an ad- 
missible transformation of the de- 
pendent variable, namely multiplica- 
tion by a positive constant, should re- 
sult and the form of the functional law 
should be unaffected. That is to say, 
the changed unit of the dependent 
variable may depend upon , but it 
shall not depend upon x, so we denote 
it by K(k). Casting this into mathe- 
matical terms, we obtain the func- 
tional equation 


u(kx) = K(k)u(lx) 


where k > 0 and KE) = 0. 
Functional equations for the other 
Cases are arrived at in a similar man- 
ner. Theyaresummarizedin Table ff 
The question is: What do these nine 
functional equations, each of which 
embodies the principle, imply about 
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TABLE 1 


THE FUNCTIONAL EQUATIONS 
PRINCIPLE OF TH! 


FOR THE LAWS SATISFYING THE 
EORY CONSTRUCTION 


Scale Types | | 
Ea. | | 
No. ) Functional Equation Comments 
Independent Dependent | 
Variable Variable 

1 ratio ratio u(kx) = K(k)u(x) &>0, K(k)>0 
2 ratio interval | u(kx) = K (k)u(x) + C(k) &k>0, K(k)>0 

3 ratio log interval | u(kx) =K(k)u(x)ce( k>0, K(k)>0, C()>0 

4 interval ratio u(kx +c) =K(k,c)u(x) k>0, K(k,c)>0 

5 interval interval u(kx +c) = K(k,c)u(x) k>0, K(k,c)>0 

+C(k,c) 

6 interval log interval | u(kx-+c) =K(k,c)u(x)ct.o | k>0, K(k,c)>0, C(k,c)>0 
7 | log interval ratio u(kxe) = K (k,c)u(x) £>0, c>0, K(k,c)>0 

8 | log interval interval u(kxe) = K(k,c)u(x)+C(k,c) | k>0, c>0, K(k,c)>0 

9 log interval log interval | u(kx°) = K(k,c)u(x)et.e &>0, c>0, K(k,c)>0, 

C(k,c)>0 


the form of u? We shall limit our 
consideration to theories where 1 is 
a continuous, nonconstant function 


jo) U3 

Theorem 1. If the independent and 
dependent continua are both ratio scales, 
then u(x) = axt, where B is independent 


of the units of both variables.* 


Set = lin Equation 1, then 


Proof. h 
Because 1 is non- 


u(k) = K(B)u(l). 
constant we may choose so that 
u(k) > 0, and because KE) > 0, it 
follows that u(1) > 0,s0 K(k) = u(k)/ 
u(1). Thus, Equation 1 becomes u(kx) 


3 In this and in the following theorems, 
made more general.if 
2 is replaced by £ +, where 7 is a constant 
independent of # but having the same unit as 
2. The effect of this is to place the zero of 
u at some point different from the zero of x. 
In psychophysics the constant Y may be re- 
garded as the threshold. The presence of 
such a constant means, Of course, that a plot 
of log u vs. log will not in general be a 
straight line. If, however, the independent 
variable is measured in terms of deviations 
from the threshold, the plot may become 
straight. Such nonlinear plots have been 
observed, and in at feast some instances the 
degree of curvature seems to be correlated 
with the magnitude of the threshold. Fur- 
ther empirical work is needed to see whether 
this is a correct explanation of the curvature. 


the statement can be 


=u(k)u(x)/u(l). Letv= log[u’u(1)1, 
then 


v(kx) 


log [u(kx)/u(1)] 

u(k)u(x) 
08 u(l)u(l) 
log [u(k)/#(1)] 

+ log [u(x)/u(1)] 

v(k) + (x) 
Since 1 is continuous, so is v, and it is 
well known that the only continuous 
solutions to the last functional equa- 
tion are of the form 


v(x) = Blog x 
= log x8 
Thus, 
u(x) = oer) 
= ax 
where « = u(1). 


We observe that since 


u(kx) = ak8x8 


ax 


B is independent of the unit of x, and 
ৰ! is clearly independent of the unit 
of u. 


| Theorem 2. If the independent con- 
tinuum 1s a ratio scale and the depend- 
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ent continuum an interval scale, then 
either u(x) = alog x + B, where 0 is 
independent of the unit of the inde- 
pendent variable, or u(x) = ox8 +5, 
where B is independent of the units of 
both variables and 5 is independent of 
the unit of the independent variable. 


Proof. In solving Equation 2, there 
are two possibilities to consider. 

1. If K(k) s 1, then define v = en". 
Equation 2 becomes v (kx) = D(k)v(x), 
Where D(k) = e0) > 0 and vis con- 
tinuous, positive, and nonconstant be- 
causeuis. By Theorem 1,v(x) = 624, 
Where & is independent of the unit of 
¥ and wheres > 0 because, by defini- 
tion,v > 0. Taking logarithms, u(x) 
= log +8, where B= log 6. 

2. If K(k) # 1, then let u and YU 
be two different solutions to the prob- 
lem, and define w = u* — #. Jt fol 
lows immediately from Equation 2 
that w must satisfy the functional 
equation w(kx) = K(k)w(x). Since 
both u and u* are continuous, so is w; 
however, it may be a constant. Since 
K(k) #1, it is clear that the cnly 
constant solution is w = 0, and this is 
impossible since 1 and u* were chosen 
to be different. Thus, by Theorem Ye 
w(x) = 0%. Substituting this into the 
functional equation for w, it follows 
that K(k)=#. Then setting + = 0 
in Equation 2, we obtain C(k) = u(0) 
X(1 — #). e now observe that 
u(x) = xf + 6, where 6 = 1(0), is a 
Solution to Equation 2: 
u(kx) = akfxs ts 
okfyf + 1 (0) ks +2 (0) — wu (Oks 
Kou (+) +2 (0) (1 — ke) 

= K(b)u(x) + C(k) 
Any other solutio 
eCause 


u*(x) 


[| 


[| 


[| 


nN is of the same form 


[[ 


u(x) + w(x) 
= ax f+ 6 + axe 
= (« + a)x8 + 5 
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It is easy to see that 6 is independent 
of the unit of x and 8 is independent 
of both units. 

A much simpler proof of this theo- 
rem can be given if we assume that u 
is differentiable in addition to being 
continuous. Since the derivative of an 
interval scale is a ratio scale, it follows 


immediately that du/dx satisfies 
du 
Equation 1 and so, by Theorem 1, টট 
= axf. Integrating, we get 
RD eB: 6 a 
se) 3 BL 0 if iB 
a logx +6 if B= -!1 


Theorem 3. If the independent con- 
tinuum is a ratio scale and the depend- 
ent continuum is a logarithmic interval 
scale, then either u(x) = tea, where o 
1s independent of the unit of the de- 
pendent variable, B 1s independent of the 
umits of both variables and 5 is inde- 
Pendent of the unit of the independent 
variable, or u(x) = ax, where 8 1s in- 
dependent of the units of both variables. 


Proof. Take the logarithm of Equa- 
tion 3 and let v = log u: 


v(kx) = K*(k) + C(b)v(x) 


where K*(k) = log K(k). By Theo- 
rem 2, either 
v(x) = oxf + 5* or v(x) = Blog x + a* 
Taking exponentials, either 

u(x) = bee or u(x) = axh 


where 6 
tion, o 


= ¢t* and, in the second equa- 
= €৭*, 


Theorem 4. If the independent con- 
tinuum is an interval scale, then XE 
impossible for the dependent continuum 
to be a ratio scale. 


Proof. Letc = 0in Equation 4, then 
by Theorem 1 we know u(x) = asx. 
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Now set k = 1 and c= 0 in Equa- 
tion 3: 
a(x + Cc) = K(l,c)axs 


so 
4+ c= K(l,c)uby 


which implies x is a constant, con- 
trary to our assumption that both 
continua have more than one point. 


Theorem 5. If the independent and 
dependent continua are both interval 
scales, then u(x) = ax + B, where B is 
independent of the unit of the inde- 
pendent variable. 


Proof. If we let c = 0, then Equa- 
tion 5 reduces to Equation 2 and so 
Theorem 2 appliss. If u(x) = alogx 
+ 8B, then choosing k = land c=0 
in Equation 5 yields 


aulog (x + c) + 8 = K(l,c)alogx 
+ K(l,o)B + C(l,c) 


By taking the derivative with respect 
to x, it is easy to see that x must be 
a constant, which is impossible. 

Thus, we must conclude that u(x) 
= «xt +B. Again, set k= 1 and 
C0, 


a(x + c)° = K(l,c)axt 
+ K(LoB + Cll) 


If 6 = 1, then differentiate with re- 
spect to x: 
oilx + co)! = K(l,c)adxi—! 


which implies x is a constant, so we 
must conclude 6=1. It is easy to see 
that u(x) = ax + B satisfies Equation 


Theorem 6. If the independent con- 
tinuum 1s an interval scale and the 
dependent continuum is a logarithmic 
interval scale, then u(x) = aef=, where 
0 is independent of the unit of the inde- 
pendent variable and B is independent 
of the unit of the dependent variable. 
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Proof. Take the logarithm of Equa- 
tion 6 and let v = log u: 


v(kx + c) = K*(k,c) + C(k,c)v(x) 


where K*(k,c) = log K(k,c). By 
Theorem 5, 


u(x) = Bx+a* 
so 
u(x) = aefs 
where a = ee*. 


Theorem 7. If the independent con- 
linuum is a logarithmic interval scale, 
then it is impossible for the dependent 
continuum to be a ratio scale. 


Proof. Let v(logx) = u(x), i.e., 2(y) 
= u(e’), then Equation 7 becomes 


v(log # + clog x) = K(k,c)u(log x) 


Thus, log x is an interval scale and vis 
a ratio scale, which by Theorem 4 is 
impossible. 

Theorem 8. If the independent con- 
tinuum is a logarithmic interval scale 
and the dependent continuum is an in- 
terval scale, then u(x) = alogx +8, 
where o 1s independent of the unit of the 
independent variable. 


Proof. Let (log x) = u(x), then 
Equation 8 becomes 


v(log k + clog x) 
=K(k,c)o(log x) + C(k,c) 


so log x and v are both interval scales. 
By Theorem S, 

v(log x) 

= alogx +B 


u(x) 


Theorem 9. If the independent and 
dependent continua are both logarithmic 
interval scales, then u(x) = ax®, where 
B is independent of the units of both the 
independent and dependent variables. 
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Proof. Take the logarithm of Equa- 
tion 9 and let v = log u: 


u(kx°) = K*(k,c) + C(k,c)v(x) 


where K*(k,c) = log XK(k,c). 
Theorem 8, 


u(x) = Blog x + a* 


By 


so 
u(x) = ee 
= ax 


Where a = ee*, 


ILLUSTRATIONS 


It may be useful, prior to discussing 
these results, to cite a few familiar 
laws that accord with some of them. 
The best source of examples is classi- 
cal physics, where most of the funda- 
mental variables are idealized as con- 
tinua that form either ratio or interval 
scales. No attempt will be made to 
illustrate the results concerning loga- 
rithmic interval scales, because no 
actual use of scales of this type seems 
to have been made. 

The variables entering into Cou- 
lomb'’s law, Ohm's law, and Newton's 
gravitation law are all ratio scales, and 
in each case the form of the law is a 
Power function, as called for by Theo- 
rem 1. Additional examples of Theo- 
rem 1 can be found in geometry since 
length, area, and volume are ratio 
scales; thus the dependency of the 
Volume of a sphere upon its radius or 
of the area of a Square on its side are 
illustrations. 

Other important variables such as 
energy and entropy form interval 
scales, and we can therefore anticipate 
that as dependent variables they will 
illustrate Theorem 2. If a body of 
constant mass is moving at velocity o, 
then its energy is of the form ov? + 6. 
If the temperature of a perfect gas is 
constant, then as a function of pres- 
sure p the entropy of the gas is of the 
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form alogp +B. No examples, of 
course, are possible for Theorem 4. 

As an example of Theorem 5 we 
may consider ordinary temperature, 
which is frequently measured in terms 
of the length of a column of mercury. 
Although length as a measure forms a 
ratio scale, the length of a column of 
mercury used to measure temperature 
is an interval scale (subject to the 
added constraint that the length is 
positive), since we may choose any 
initial length to correspond to a given 
temperature, such as the freezing 
point of water. If the temperature 
scale is also an interval scale, as is 
usually assumed, then the only rela- 
tion possible according to Theorem 5 
is the linear one. 


DIscuUssION 


Some with whom I have discussed 
these theorems—which from a mathe- 
matical point of view are not new— 
have had strong misgivings about 
their interpretation ; the feeling is that 
something of a substantive nature 
must have been smuggled into the 
formulation of the problem. They 
argue that practically any functional 
relation can hold between two vari- 
ables and that it is an empirical, not 
a theoretical, matter to ascertain what 
the function may be in specific cases. 
To support this view and to challenge 
the theorems, they have cited ex- 
amples from physics, such as the ex- 
Pponential law of radioactive decay or 
some sinusoidal function of time, which 
seem to violate the theorems stated 
above. We must, therefore, examine 
the ways in which these examples by- 
pass the rather strong conclusions of 
the present theory. K 

All physical examples which have 
been suggested to me as counter- 
examples to the theorems have a 
common form: the independent vari- 
able is a ratio scale, but it enters into 
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the equation in a dimensionless fash- 
ion. For example, some identifiable 
value of the variable is taken as the 
reference level xo, and all other values 
are expressed in reference to it as x/x0. 
The effect of this is to make the quan- 
tity x/xo independent of the unit used 
to measure the variable, since kx/kxo 
= x/x0. In periodic functions of 
time, the period is often used as a 
reference level. Slightly more gen- 
erally, the independent variable only 
appears multiplied by a constant c 
whose units are the inverse of those 
of x. Thus, whenever the unit of x 
is changed by multiplying all values 
by a constant & > 0, it is necessary to 
adjust the unit of c by multiplying it 
by 1/k. But this means that the 
product is independent of k: (c/k) (kx) 
= cx. The time constant in the law 
of radioactive decay is of this nature. 

There are two ways to view these 
examples in relation to the principle 
stated above. If the ratio scale x is 
taken to be the independent variable, 
then the invariance part of the prin- 
ciple is not satisfied by these laws. If, 
however, for the purpose of the law 
under consideration the dimensionless 
quantity cx is treated as the variable, 
then no violation has occurred. AlL- 
though surprising at first glance, it is 
easy to see that the principle imposes 
no limitations upon the form of the 
law when the independent variable is 
dimensionless, i.e., when no trans- 
formations save the identity are ad- 
missible. 

We are thus led to the following con- 
clusion. Either the independent vari- 
able is a ratio scale that is multiplied 
by a dimensional constant that makes 
the product independent of the unit of 
the scale, in which case there is no re- 
striction upon the laws into which it 
may enter, or the independent vari- 
able is not rendered dimensionless, in 
which case the laws must be of the 


form described by the above theorems. 
Both situations are found in classical 
physics, and one wonders if there is 
any fundamental difference between 
them. Ihave not seen any discussion 
of the matter, and I have only the 
most uncertain impression that there 
is a difference. In many physical situa- 
tions where a dimensional constant 
multiplies the independent variable, 
the dependent variable is bounded. 
This is true of both the decay and 
periodic laws. Usually, the constant 
is expressed in some natural way in 
terms of the bounds, as, for example, 
the period of a periodic function. 
Whether dimensional constants can 
legitimately be used in other situa- 
tions, or whether they can always be 
eliminated, is not at all apparent to 
me. 

One may legitimately question which 
of these alternatives is applicable to 
psychophysics, and the answer is far 
from clear. The widespread use of, 
say, the threshold as a reference level 
seems at first to suggest that psycho- 
physical laws are to be expressed in 
terms of dimensionless quantities; 
however, the fact that this is done 
mainly to present results in decibels 
may mean no more than that the 
given ratio scale is being transformed 
into an interval scale in accordance 


with Theorem 2: 


Yy = alog x/x0 
=alogx +B 
where 
B = — alog x0 


In addition to dimensionless vari- 
ables as a means of by-passing the re- 
strictions imposed by scale types, 
three other possibilities deserve dis- 
cussion. 

First, the idealization that the scales 
form mathematical continua and that 
they are related by a continuous func- 
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tion may not reflect the actual state 
of affairs in the empirical world. It 
is certainly true that, in detail, physi- 
cal continua are not mathematical 
continua, and there is ample reason 
to suspect that the same holds for 
psychological variables. But the as- 
sumptions that stimuli and responses 
both form continua are idealizations 
that are difficult to give up; to do 
so would mean casting out much 
of psychophysical theory. Alterna- 
tively, we could drop the demand that 
the function relating them be con- 
tinuous, but it is doubtful if this 
would be of much help by itself. The 
discontinuous solutions to, say, Equa- 
tion 1 are manifold and extremely wild 
in their behavior. They are so wild 
that it is difficult to say anything pre- 
cise about them at all (see Hamel, 
1905; Jones: 1942a, 1942b), and it is 
doubtful that such solutions represent 
empirical laws. 

Second, casual observation suggests 
that it might be appropriate to assume 
that at least the dependent variable is 
bounded, e.g., that there is a psycho- 
logically maximum loudness.  Al- 
though plausible, boundedness cannot 
be imposed by itself since, as is shown 
in the theorems, all the continuous 
solutions to the appropriate functional 
equations are unbounded if the func- 
tions are increasing, as they must be 
for empirical reasons. It seems clear 
that boundedness of the dependent 
variable is intimately tied up either 
with introducing a reference level so 
that the independent variable is an 
absolute scale or with some discon- 
tinuity in the formulation of the prob- 
lem, Possibly in the nature of the 
variables or possibly in the function 
relating them. Actually, one can es- 
tablish that it must be in the nature 
of the variables. Suppose, on the 
contrary, that the variables are ratio 
scales that form numerical continua 


and that they are related by a func- 
tion u that is nonnegative, noncon- 
stant, and monotonic increasing, but 
not necessarily continuous. We now 
need only show that u cannot be 
bounded to show that the discon- 
tinuity must exist in the variable. 
Suppose, therefore, that it is bounded 
and that the bound is M. By Equa- 
tion 1, u(kx) = K(k)u(x) < M, so 
u(x) < M/Kf{k). Fork> 1, themon- 
otonicity of u implies that u(x) 
<u(kx) = K(k)u(x), so choosing u(x) 
>0 we see that K(k)> 1. If for 
some k> 1, K(k) > 1, then K can be 
made arbitrarily large since, for any 
integer mn, K(k) = K(k)", but since 


u(x) < রঘ this implies 4 = 0, con- 
trary to assumption. Thus, for all 


k> 1, K(k) = 1, which by Equation 
1, means u(kx) = u(x), for all x and 
k> 1. This in turn implies 4 is a 
constant, which again is contrary to 
assumption. Thus, we have estab- 
lished our claim that some discon- 
tinuity must reside in the nature of 
the variables. 

Third, in many situations, there are 
two or more independent variables; 
for example, both intensity and fre- 
quency determine loudness. Usually 
we hold all but one variable constant 
in our empirical investigations, but 
the fact remains that the others are 
there and that their presence may 
make some difference in the total 
range of possible laws. For example, 
suppose there are two independent 
variables, x and y, both of which 
form ratio scales and that the depend- 
ent variable 1 is also a ratio scale, 
then the analogue of Equation 1 is 


u(kx,hy) = K(k,h)u(x,y) 


where k > 0, h > 0, and K(k,h) > 0. 
We know by Theorem 1 that if we 
hold one variable, say y, fixed at some 


R. DUNCAN LUCE 81 


value and let & = 1, then the solution 
must be of the form 


u(x,y) = aly) 


But holding x constant and letting 
& = 1, we also know that it must be 


of the form 


u(y) = S()y 
Thus, 
aly)x3™ = 5(x)y® 


If we restrict ourselves to 1's having 
partial derivatives of both variables, 
this equation can be shown _(see Sec- 
tion 2.C.2 of Luce [in press]) to have 
solutions only of the form: 

u(x,y) = axbyete los = 
Thus, the principle again severely re- 
stricts the possible laws, even when we 
admit more than one independent 
variable.‘ 

It must be emphasized that the 
remark in Footnote 3 does not apply 
here. Ifa function that depends upon 
one independent variable is added to 


the other, €.g., 
u(y) = ex + YP 


then wholly new solution possibilities 
exist (see Section 2.C.3 of Luce Lin 


press ]). 
In sum, there appear to be two ways 


around the restrictions set forth in the 
theorems. The first can be viewed 
either as a rejection of Part 2 of the 
principle or as the creation of a dimen- 
sionless independent variable from a 
ratio scale; it involves the presence of 
dimensional constants that cancel out 


4+ The use of this argument to arrive at 
the form of u(x,y) seems much more satis- 
factory and convincing than the heuristic 
development given in Section 2.C of Luce (in 
press), and the empirical suggestions given 
there should gain correspondingly in interest 
as a result of the present work. 


the dimensions of the independent 
variables. This appears to be par- 
ticularly appropriate if the dependent 
variable hasa true, well-defined bound. 
The second is to reject the idealiza- 
tion of the variables 4s numerical con- 
tinua and, possibly, to assume that 
they are bounded. 

On the other hand, if the theorems 
are applicable, then the possible psy- 
chophysical (and other) laws become 
severely limited. Indeed, they are so 
limited that one can argue that the 
important question is not to deter- 
mine the forms of the laws, but rather 
to create empirically testable measure- 
ment theories for the several psycho- 
physical methods in order that we may 
know for certain what types otf scales 
are being obtained. Once this is 
known, the form of the psychophysical 
functions is determined except for 
some numerical constants. In the 
meantime, however, experimental de- 
terminations of the form of the psy- 
chophysical functions by methods for 
which no measurement theories exist 
provides at least indirect evidence of 
the type of scale being obtained. For 
example, the magnitude methods seem 
to result in power functions, which 
suggests that the psychological meas- 
ure is either a ratio or logarithmic in- 
terval scale, not an interval scale. 
Since the results from cross-modality 
matchings tend to eliminate the loga- 
rithmic interval scale as a possibility, 
there is presumptive evidence that 
these methods yield ratio scales, as 
Stevens has claimed. 


SUMMARY 


The following problem was con- 
sidered. What are the possible forms 
of a substantive theory that relates a 
dependent variable in a continuous 
manner to an independent variable? 
Each variable is idealized as a nu- 
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TABLE 2 


THE PossIBLE LAWS SATISFYING THE PRINCIPLE OF THEORY CONSTRUCTION 


ETO Possible Laws Comments* 
Independent Variable Dependent Variable 
ratio ratio u(x) =f B/zx; B/u 
ratio interval u(z) =a log x +B a/x 
u(x) =o 5 B/x; Blu; 5/= 
ratio log interval u(x) = 58 a/u; B/x; B/u; 6/x 
u(x) = ax B/z; B/u 
interval ratio impossible 
interval interval u(x) =x +B B/z 
interval log interval u(x) =at= a/x; B/u 
log interval ratio impossible 
log interval interval u(x) =a log +B a/x 
log interval log interval u(x) = B/x; B/u 
| 
© The notation a/z means “a is independent of the unit of x." 
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MULTIVARIATE INFORMATION TRANSMISSION*T 


WILLIAM J. McGILE 
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A multivariate analysis based on transmitted information is presented. 
It is shown that sample transmitted information provides tt simple method 
for measuring and testing association in multi-limensionnl contingency 
tables. Relations with analysis of variance are pointed out, and statistical tests 
are described. 


Several recent articles in the psychological journals have shown how 
ideas derived from communication theory are being applied in psychology. 
It is not widely understood, however, that the tools made available by 
communication theory are useful for analyzing data whether or not we 
believe the human organism is best described as a communications system. 
This paper will present an extension of Shannon’s (10) measure of trans- 
mitted information. It will be shown that transmitted information leads 


to a simple multivariate analysis of contingency data, and to appropriate 
statistical tests. 


1. Basic Definitions 


Let us consider a communication channel and its input and output. 
Transmitted information measures the amount of association between the 
input and output of the channel. If input and output are perfectly correlated, 
all the input information is transmitted. On the other hand, if input and 
output are independent, no information is transmitted. Naturally most 
Cases of information transmission are found between these extremes. There IS 
Some uncertainty at the receiver about what was sent. Some information is 
transmitted and some does not get through. Ee 

We are interested not in what the transmitted information is, but in 
the amount of information transmitted. Suppose that we have a discrete 
Input variable, x, and a discrete output variable, y. Since x is discrete, it 
takes on values or signals k: = 1,2, 3, --- , X with probabilities indicated 
by p(k). Similarly, Y assumes values m = 1,2,3, ---, Y with probabilities 
pm). At it happens that Ff: is sent and m is received, we can speak of the 
Joint input-output event (k,m). This joint event has probability p,m). 


*This work was iT DY -s Operations 
7 2s supported in part by the Air Force Human Factors OF 
ee Laboratories, and in part jointly by the Army, Navy, and Air Force under 
See te Massachusetts Institute of Technology. et I 
dependently by"T. le pindices and tests discussed in this paper have been developc 


\ Smi “ > ELE ichigan, # by W. R. Garner 
at Johns Hopkins Univers (UD) at the University of Michigan, and by 


This article appeared in Psychometrika, 1954, 19, 97-116. Reprinted with permission. 
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‘The rules governing the selection of signals at either end of the channel must 
be constructed so that 


my 


3 D(A) = 2 Mm) = 2 Dk,m) = 1. 


Under these conditions, assuming successive signals are independent, the 
amount of information transmitted in “bits” per signal is defined as 
T(z;y) = Hx) + HY) — H(z,y), (60) 


Where 
Ht) = -— Z D(k) logs p(k), 
HY) = — YS p(m) logs p(m), 
H(z,y) = - 2 D(k,m) logz p(k,m). 


One “bit” is equal to —logz (3) and represents the information conveyed by 
a choice between two equally probable alternatives. Our development will use 
the bit as a unit, since this is the convention in information theory, but 
any convenient unit may be substituted by changing the base of the logarithm. 

If there is a relation between x and y, H(z) + H(y) > H(z,y) and 
the size of the inequality is just T(z;y). On the other hand, if x and y are 
independent, H(x,y) = H (2) + H(y) and T(z;y) is zero. It can be shown 
that T(z;y) is never negative. 

The presentation to this point has been an outline of the properties of 
the measure of transmitted information as set forth by Shannon (10). These 
properties may be summarized by stating that the amount of information 
transmitted is a bivariate, positive quantity that measures the association 
between input and output of a channel. There are, however, very few restric- 
tions on how a channel may be defined. The input-output relations that 
occur in many psychological contexts are certainly possible channels. Con- 
sequently we can measure transmitted information in these contexts and 


anticipate that the results will be interesting. 


2. Sample Information 
Our development will be based on sample measures of information, i.e., 
on measures of information constructed from relative frequencies. 
Suppose that we make n observations of events (k,m). We identify 
n,n a5 the number of times that k was sent and m was received. This means 


that 
nN: = Nias 1 
্থ 
Nn = XX Nim 
[3 
= টি Nim 
kim 
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where n, is the number of times that k was sent, n. iS the number of times 
that m was received, and 1 is the total number of observations. A particular 
experiment can then be represented by a contingency table with XY cells 
and entries nem - 

We may estimate the probabilities, p(k), p(m), and p(k,m) with n/n, 


Nn/n, and nin/n, respectively. Sample transmitted information, T"(z;y), is 
defined as 


T'(z;y) = H'G) + Hy) — H'(,y), 0) 


where H'(z), H'(y) and H'(z,y) are constructed from relative frequencies 
instead of from probabilities. [Throughout the paper a prime is used over @ 
quantity to indicate the maximum likelihood estimator of the same quantity 
without the prime, e.g., T"(z;y) is an estimator for T(z;y).] As before, T"(z;y) 
is the amount of transmitted information (in the sample) measured in hits” 
per signal. k 

Since it is difficult to manipulate logs of relative frequencies, we will 
introduce an easier notation: 


in = Din IOE2 Mes 
Am 

Sy = ঠি Ds Ny log mn, , 
k 


l 
Se == Do Hi l0gs Nm» 
sn 


|| 


log, n. 
Expressions involving sample measures of information are easier to 
handle in this notation. For example, T'(z;y) becomes 
T'(z;y) = s — si — Sm TF Sim 6) 


Equations (2) and (3) are equivalent expressions for T'(z;y). When 
We write equations like (3), we shall say that these equations are written in 
S-notation. Thus (3) is (2) in s-notation. 


3. Three-Dimensional Transmitted Information 


Now let us extend the definition of transmitted information to include 
two Sources, uv and v, that transmit to y. To accomplish this we replace % 
In equation (2) with u, v and we find that 


T'(uv;y) = Huy) + H'(y) — H'(uv,y), [© 


where x has been subdivided into two classes, w and v. The possible values 
of uarei= 1, 2,3, ... , U, while v assumes values j = 1, 2, 3, --- , V. The 
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subdivision is arranged so that the range of values of u and v jointly constitute 
the possible values of x. This means that the input event, k, can be replaced 


by the joint input event (1,5). Consequently we have 
T= i 5 


and the direct substitution of u,v for x in (2) is legitimate. 

Our new term, T'(u,v;y), measures the amount of information trans- 
mitted when u and v transmit to y. It is evident, however, that the direction 
of transmission is irrelevant, for examination of (4) reveals that 


T'(uv;y) = TYyswd). 


This means that nothing is gained formally by distinguishing transmitters 
from receivers. The amount of information transmitted is a measure of 
association between variables. It does not respect the direction in which 


the information is travelling. On the other hand, we cannot permute symbols 


at will, for 
Tuy = Hwy) + H’®) — H'uv,y), 


and this is not necessarily equal to T'(u,v;y). 
Our aim now is to measure T'(u,v;y) and then to express T'(u,v;y) as 


a function of the bivariate transmissions between u and y, and v and y. 
Computation of T(u,v;y) is not difficult. Our observations of the joint 
event (i,5,m) organize themselves into a three-dimensional contingency table 
with UVY cells and entries n,;m . We can compute the quantities in (1) from 
this table, or we can write 

T(uv;iy) = § — Sm — Si Tt Sim, (5) 
where 

1 
Siim = 7 DD Miim lOg2Miin s 


Nm 


are defined by analogy with the s-terms in equation (3). 
e want to study transmission between uw and y. We 
two ways. First let us reduce the three-dimensional 
dimensions by summing over . The entries in the 


and the other s-terms 
Now suppose Ww! 
may eliminate v in 
contingency table to two 
reduced table are 


Nim = BBE TE 
f 
We have for the transmitted information between u and y, 
Tuy) = s — Si — Sm Sin (6) 


The second way to eliminate v is to compute the transmission between u and 
y separately for each value of v and then average these together. This trans- 
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mitted information will be called T’(u;y), where 


Tus) = 2 TiC], 0) 


and Ti(u;y) is information transmitted between uw and y for a single value 
of v, namely Jj. It is readily shown that 


T.(u;jy) = si — Si; — Sim TF Siim (8) 


We see that Ti(u;y) is written in the same way as T'(u;y) except that the 
subscript J is added to each of the s-terms. 

There are three different pairs of variables in a three-dimensional con- 
tingency table. For example, the two equations for transmission between 
v and y are written 


TOW) == 8; = BA Ss (9) 

TIO) = 84 = 33 — Stn FSi (10) 
Finally we may study transmission between u and v, i.e., 

Tuy) = s—s;-s; + si; ly (11) 

T(E) = Su — Sine — Sim 8m s (12) 


With these results in mind let us reconsider the information transmitted 
between uw and y. If v has an effect on transmission between u and y, then 
Ti(u;jy) = T'(u;y). One way to measure the size of the effect is by 


A'luwy) = Tu;y) — T(u;y), 
A‘lwy) = —-s + sits; fs Si —Sim— Sim Sim (18) 
A few more substitutions will show that 


A’(uvy) 


[| 


Tuy) — T(u;y), 
Tio;y) — TO;y), (4) 
Tus) — Tus). 


In view of this symmetry, we may call A’(uy) the u-v-y interaction informa- 
tion. We see that A’(uy) is the gain (or loss) in sample information trans- 


mitted between any two of the variables, due to additional knowledge of the 
third variable. 


Nf Hd 7 . + . . 1 
Now we can express the three-dimensional information transmitted 
mM uy to y, i.e., T'(u,v;y), as a function of its bivariate components, for 


T(uv;y) = T'(u;y) + T;y) + A’(uvy), 015) 
T'(uwv;y) = Tuy) + Tio;y) — A’(uvy). (16) 


fro 


[| 
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Equations (15) and (16) taken together mean that T'(u,v;y) can be represented 
by a diagram with overlapping circles as shown in Figure 1. The diagram 
assumes what we shall call “‘positive’’ interaction between u,v and y. Inter- 


Ty (uy) Tulv;y) 


T'(uv; y) 


FIGURE 1 

Schematic diagram of the components 
of three-dimensional transmitted in- 
formation. The diagram shows that 
three-dimensional transmission can be 
analyzed into a pair of bivariate trans- 
missions plus an interaction term. 
The meanings of the symbols are ex- 
plained in the text. 


action is positive when the effect of holding one of the interacting variables 
constant is to increase the amount of association between the other two. 
This means that T’(u;y) > T'(u;y) and Ti@;y) > T"(;y). [Because of (14), 
if one of these inequalities holds, both must hold.] Later on, however, we 
shall show that interaction may be negative. When this happens, relations 
between the interacting variables are reversed, and the diagram in Figure 1 


is no longer strictly correct. 


4. Components of Response Information 


The multivariate model of information transmission is useful to us 
because the situations treated by communication theory are not the same as 
those we deal with in psychological applications. The engineer is usually able 
to restrict himself to transmission from a single information source. He 
knows the statistical properties of the source, and when he speaks of noise he 
means random noise. This kind of precision is seldom available to us. In our 
experiments we generally do not know in advance how many sources are 


transmitting information. We must therefore be careful not to confuse 


statistical noise with the experimenter’s ignorance. 
The bivariate model of transmitted information provided by communi- 


cation theory tells us to attribute to random noise whatever uncertainty there 
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is in specifying the response when the stimulus is known (DD). Consequently, 
if several sources transmit information to responses, the bivariate model 
will certainly fail to discriminate effects due to uncontrolled sources from 
those due to random variability. On the other hand, the multivariate model 
Can measure the effects due to the various transmitting sources. For example, 
in three-dimensional transmission we find that 


HY) = Hidy) + Tuy) + To;y) + AOwy), (17) 

where H'(y) = s — sn and Hi.(y) = SEF — Sia,» 
We see that H'(y), the response information, has been analyzed into 
An error term plus a set of correlation terms due to the input variables. The 
error term, H,.(y), is the residual or unexplained variability in the output, 
Y, after the information due to the inputs, u and v, has been removed. In 


bivariate information transmission, the response information is analyzed less 
precisely. For example, we may have 


HY) = HQ) + Tuy). (18) 


In this case the error term is Hi.(y) because only one input, u 
Shannon (10) showed that 


is recorded. 


1 


Hy) 2 Hi). 


In other words the error term, when only u is controlled, cannot be increased 
if we also control v. In fact 


HAy) = Hy) + Tio;y). (19) 


Equation (19) is proved by expanding both sides in s-notation. Thus 
if wu and v are stimulus variables that transmit information via responses, y, 
We have an error term; H:(y), provided we keep track of only one of the 
inputs, namely, u. However, this error term contains a still smaller error 
term as well as the information transmitted from v. Controlling v is thus 
seen to be equivalent to extracting the association between v and y from the 
noise. Multivariate transmitted information is essentially information 
analyzed from the noise part of bivariate transmission. 


5. An Ezample 

The kind of analysis that multivariate information transmission yields 
can be illustrated by a set of data obtained from one subject in an experiment 
On frequency Judgment. 

Four equally loud tones, 890, 925, 970, and 1005 cycles per second 
Were Presented to the subject one at a time in random order. Each tone Was 
2 Second long and separated by about 3 seconds from the next tone. During 
Preliminary training the subject learned to identify the tones by pairing them 
Re four response keys. In experimental sessions, a loud masking SUE টং 

Urned on and a random sequence of 250 tones was presented against the 


Response 


Stimulus-Response Frequency Table 
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noise background. A flashing light told the subject when the stimulus occurred, 
and he was instructed to guess if in doubt about which one of the four tones 
it Was. 

One object of the experiment was to find weights for both the frequency 
stimulus and the immediately preceding response in determining which key 
the subject would press. Tests were run at several signal-to-noise ratios. 
The data presented here were obtained when the signal-to-noise ratio was 
close to the masked threshold. 

In order to calculate weights, we can consider the experiment as an 
example of three-dimensional transmission. Our analysis is based on the 
responses to the 125 even-numbered stimuli. The odd-numbered responses are 
considered as the context in which the subject judged the even-numbered 
stimuli. The odd-numbered stimuli are ignored in this analysis. 

The stimuli will be designated as the variable u. Last previous responses 
are called “presponses’”’ and they will be indicated by the variable v. These 
are the inputs. Current responses are represented by y. This is the output 
variable. Thus we can identify the joint event (t,5,m) as the occurrence of 
response m to stimulus 1, following presponse fj. Failure to respond is con- 
sidered as a possible response. Consequently there are four stimulus cate- 
gories and five response categories. 

The subject’s responses to the 125 test stimuli were sorted into a 
4 X 5 X 5 contingency table. Two of the reduced tables that were obtained 
from this master table are reproduced here in order to illustrate our com- 

TABIE 1 TABLE 2 


Presponse-Response Frequency Table 


Presponse 


Stimulus 


y 2 3 
AE SEE yl EL 


10 26 27 bh 18 


1 : 1 [1 | 3 I: 10 
8 
[= 
B 2 2 13 8 20 b LT 
2 

3 ্ ks 12 6 9 37 
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putations. For example, the Stimulus-Response plot in Table 1 has entries 
Nim . The calculation for s;. goes as follows: 


Sin ক্র [1 log, 1 + 5 log» 5 + 12 log,» 12 + -.-- + 7 logs» 7 + 10 logs, 10], 


Sin = 374.05750/125, 
8in = 2.99246. 


In the same way, s;. is computed from the figures for n; in the Presponse- 
Response table, Table 2: 


[| 


sin = 135 [1 log: 1 + 1 logs 1 + 2 loge 2 4 --- + 9 loge 9 +3 logs 8), 


Sim = 372.38710/125, 
Sim = 2.97910. 


We obtain the value for s; from the n; in the bottom marginal of Table 1: 


s, = 53 31 log: 31 + 30 logs 30 + 33 log, 33 + 31 logs 31, 


& 
|| 


620.83188/125, 

8S; = 4.96665. 
The computation for s is based on the total number of measurements: 
8S = log: 125 = 6.96579. 


It is evident that these calculations are performed very easily with 
2 table of n log, n. If he wishes, the reader may also make the computations 
with tables of p log p like those prepared by Newman (8), and Dolansky (3). 
The use of p log, p tables for analyzing discrete data is not recommended, 
however, because it leads to rounding errors that the table of n log n avoids. 


The complete set of s-terms in the experiment on frequency judgment worked 
Out as follows: 


Siim = 1.45211 8; = 4.96665 
5:; = 2.91389 8; = 4.79269 
Sim = 2.99246 Sm» = 4.93380 
Sim = 2.97910 $ = 6.96579 


In section 4 it was 


Shown that response information, H'(y), can be 
analyzed into components 


HY) = Hi) + T'usy) + Tei) + Avy). (17) 
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Since H'(y) = s — sm , We see that H'(y) = 2.03199 bits. If the subject 
had used the four response keys equally often, this figure would have been 
at most 2 bits. The extra information shows that the subject sometimes did 
not respond. This can be verified from the right-hand marginals in Tables 1 
and 2. The rest of the quantities in equation (17) are easily computed from 
s-terms. For example, Hi.(y) is computed from si; — Siim + We see that 
H:.(y) is 1.46178 bits. This is the part of the response information that 
is not accounted for either by the auditory stimuli or the presponses. Con- 
sequently, 1.46178/2.03199 or 72 per cent of the response information is 
unanalyzed error. Some 28 per cent of the response information must therefore 
be due to associations between the subject's responses and the two predicting 
variables. 

If we consider the association between auditory stimuli (u) and responses 
(y), we have 

Tuy) = § — Si — Sm Tt Sim} 
T(u;y) = .05780. 

Thus only .058 bits are transmitted from the frequency stimuli, accounting for 
less than 3 per cent of the response information. This is not surprising because 
the signal-to-noise ratio was set near the masked threshold and the stimuli 
were difficult to hear. 

If we consider the association between presponses 
responses (y), We find a little more transmitted information: 


Ty) = s — Si — Sm TF Sim, 
T’0;y) = .21840. 
This value of .218 bits transmitted, amounts to some 11 per cent of the 


response information. 
The last element in equation (17) is the stimulus X response X presponse 


interaction, A’(wy). This is computed from 

A'lwy) = SF Sit Sit Sm — Sic Sin ™ Sin T+ Sim, 

A’(luvy) = .29401. 
We see that about 14 per cent of the response information is due to the 
interaction. Knowledge of the interaction also permits us to hold one of the 


inputs constant while measuring transmission from the other input. For 
example, the transmission from stimuli to responses with presponses held 


constant is: 


(%) and current 


Si —_Sii Sim T Sim 
Tuy) + Avy) 
35181. 


|| 


T(u;y) 
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Our calculations for the parts of the response information that we 
can analyze with the three-dimensional model, lead to weights of approxi- 
mately 3, 11 and 14 per cent for stimuli, presponses and interaction respec- 
tively. These figures sum to 28 per cent, the amount of transmitted informa- 
tion we predicted from the size of the noise term. We can also obtain this total 


weight directly by computing the information transmitted from both inputs 
together. We have 


T(uv;y) = s — sn — Si FH Siin 
Tv) =. BOL. 


If we now divide this three-dimensional transmitted information by the 
response information, we get back our figure of 28 per cent. 

There are several points worth noting about our application of informa- 
tion theory to this experiment. The first is that the analysis is additive. 
The component measures of association plus the measure of error (or noise) 
sum to the response information. Furthermore, the an 
approximations are involved. The process is very simil 
of a sum of squares in analysis of variance. As a matte 
can be worked out in analysis of variance that is exe 
s-notation in multivariate information transmission (4). 


alysis is exact. No 
ar to the partition 
r of fact, a notation 
2Ctly parallel to the 


The second point is that information transmission is made to order 
for contingency tables. Measures of transmitted information are zero when 
variables are independent in the contingency-sense (as opposed to the restric- 
tion to linear independence in analysis of variance). In addition, the analysis 
is designed for frequency data in discrete categories, while methods based on 
analysis of variance are not. No assumptions about linearity are introduced 
in multivariate information transmission. Furthermore, when statistical 
tests are developed in a later section, it will be Shown that these tests are 
distribution-free in the sense that they are extensions of the familiar chi- 
Square test of independence. 

The measure of amount of information transmitted also has certain 
inherent advantages. Garner and Hake (2) and Miller (5) have pointed out 
that the amount of information transmitte. 
of the number of perfectly discriminated input-classes. In experiments on 
discrimination like the one we have discussed, the measure provides an 
immediate picture of the subject’s discriminative ability. Miller has also 


discussed applications of this property in mental testing and in the general 
theory of measurement. 


d is approximately the logarithm 


6. Independence in Three-Dimensional Transmission 


It is evident from the definition of transmitted information that 
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T'(u,v;y) = 0 when the output is independent of the joint input, i.e., when 


Nin = Nini 20 
চে (20) 


With this kind of independence, we can show that 
Siim = St Sm S$. 
This expression for S;;imn May be substituted into (5) to confirm the fact that 
T(uvjy) = 0 
Now sup 
that is to say, 


pose that T'(uv;y) > 0 but that v and y are independent, 


Nin 21) 


Nim = লক 


‘This leads to 
Sim = Sit Sm Ss. 
If we substitute for Sim in equation (9), we find that T'(v;y) = 0. Equation 


(21) does not provide a unique condition for independence between v and y. 
To show this, let us pick some value of u and study the v-to-y transmission 


at that value of u. We now require that 


— Nii tim (22) 


nin = 
If we have (22) for all 4, we must have 

Siim = S11 Sim Si, 
on in (10) that Tiফ;y) = 0. This is the situation 


dent provided that u is held constant. It is an 
ow from (14) that if this kind of independ- 


and it follows from substituti 
in which v and y are indepen 
interesting case because We can sh 
ence happens, 

A'wy) = —T'Y 3Y) 


The sign of T’(v;y) must be positive or zero so that — T'(v;y) must be negative 
or zero. Consequently, A’(uvy) can be negative. We see that negative inter- 
action information is produced when the information transmitted between a 
pair of variables is due to 2 regression on 2 third variable. Holding the inter- 
acting variable constant causes the transmitted information to disappear. 

If we have the independence defined by (21), we may not necessarily 
have the independence defined by (22). Let us suppose that we have both, 1.e., 


that we have 
Sin = SET $m 8 


Siim = S1i f+Simn 5; 
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Now we substitute for s;» and s;; in equation (8). 
T(u;y) = s; — sii — Sim TP Siim 
T(u;y) = = 8 Se Te Big Sia 3 
Tuy) = s — si — sn Sim, 
Tu;y) = Tu;y). 


|| 
Re 
2 


Both kinds of independence, (21) and (22), together mean that v is not 
involved in transmission between u and y. When this happens we do not 
have three-dimensional transmission, since u is the only input variable 
(provided that no information is transmitted between uw and v). As might 
be expected, both kinds of independence can be generated from a single 
restriction on the data, namely 


Where V is the number of classes in v. 
We have studied the case where v is independent of y. We could have 


had uw independent of y, or u independent of v. ‘The results are analogous to 
those we have presented. 


7. Correlated Sources of I nformation 


Three-dimensional transmitted information, T’(uv;y), accounts for 
only part of the total amount of association in a three-dimensional contingency 
table. It does not exhaust all the association in the table because it neglects 
the association between the inputs. When this association is considered, i.e., 
When all the relations in the Contingency table are represented, we are led to 


an equation that is very useful for generating the components of multivariate 
transmission. Consider 


Cauwvy) = HW) + HO) + HY) — Hwy). (238) 
If we add and subtract H'(u,), we obtain 
Cuv,y) = Tuo) + T(u,v;y), 
Cuv,y) = Tso) + Twsy) + Toy) + Ay). 05 
We see that C(u,v,y) generates all possible components of the three corre- 
lated information-sources, u, v, and y. 
8. Four-Dimensional Transmitted I. nformation 


It will be instructive to extend our measures one step further, i.e., to 
transmitted information with three input variables, since from that point 
results can be generalized easily to an N-dimensional input. For simplicity 
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we shall restrict our development to the case of a channel with a multivariate 
input and a univariate output. The more general case with N inputs and 
HM outputs does not present any special problems, and can be constructed 
with no difficulty once the rules become clear. 

Let us add a new variable w to the bivariate input, u,v. The joint input 
We suppose that w sends signals hh = 1, 2,3; :--- W. This gives 


is Now u,v,Ww. 
and y. We can proceed to define a four- 


us four sources of information U,v,Ww, 
way interaction information, A‘(uwwy), as follows: 


A'wwy) = Aiuwy) — A’Cwy). 


We have already defined A'(wy). The definition of Ay(wy) will be similar 
except that the subscript w indicates that A'(wy) is to be averaged over w. 
As we have already noted, this is accomplished by adding the subscript h to 
each of the s-terms that make up A’(wy). Consequently 


ALwy) = —Sh + shi TF Shi TH Shm ™ Shii — Shim ™ Shim TL Suiim « (25) 


A’(wwy) is symmetrical in the sense that it does not 


It is readily shown that 
hosen for averaging, i.e., 


matter which variable is ¢ 
A'Gwwy) = Ailvwy) — A‘Wwy), 


|| 


A'(uwy) — A(uwy), (26) 


| 


AYGwy) — A’wy), 
Aiuww) — A’(uvw). 


ormation gained (or lost) in trans- 
hen any three of the variables are 


We see that A’(uwwy) is the amount of inf 
mission by controlling 2 fourth variable w 
already known. 

If we examine 2 


table, we obtain 
Cuv,w,y) = Tui) + Tuiw) + Tsp) + Tow) + Tsp) + Tip) 
+ Aww) + A’Cwy) + ACuwy) + A’ewy) + Awwy), (27) 


1] possible associations in a four-dimensional contingency 


where 
Cuv,w,y) = HW +H +H T+HY- H'(u,v,w,y). 
Equation (27) can be proved by expanding both sides in s-notation. 
It turns out that in the general case, C'(uv,w, --- , 9) is expanded by writing 
down T-terms for all possible pairs of variables, and A-terms for all possible 
combinations of three, four variables and so on. 
Four-dimensional transmitted information from uv,w to Yy, le., 


T'(u,v,w;y), can be written as follows: 
Tuv,wiy) = HY) + H'uv,w) — Huv,w,y)- (28) 
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The same arguments are used to justify (28) as were used in the case of (4) 
in three-dimensional transmission. To find the components of T'(u,v,w;y), 
we note that 


T(uv,w;y) = Cuv,w,y) — C(u,v,w). (29) 


This means that T"(u,v,w;y) contains all the components of C’(u,v,w,y) except 
the correlations among the inputs. Consequently the components of 
T'(u,v,w;y) are - 


T'uv,w;y) = Tuiy) + Toy) + T(w;y) 
+ A‘lwy) + A’(uwy) + A’Lwy) + A(uwwy). (30) 


The components of T'(uv,w;y) are shown in schematic form in Figure 2. 


Lee 


T' (uw; y) 


FiGuRE 2 
Schematic diagram of the components 
of four-dimensional transmitted in- 
formation, with three transmitters and 
a single receiver. 


If it happens that 
Miim = Niim/W, 
Where W is the number of classes in w, all the components of C’(u,v,w,y) that 
are functions of w drop out and C’(uv,w,y) = C©’(u,v,y). In similar fashion, 
C(u,v,y) can be reduced to C"(u,y). This is precisely what we did in the 
analysis of independence in three-dimensional transmitted information. Since 


uy) = T'(u;y), we see that all cases of transmission with multivariate 
Inputs can be related to the bivariate case. 


With three inputs controlled, we are ready to extend the analysis of 
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response information in section 4, a step further. We have 
HY) = Hise(y) + T(uv,w;y). (31) 


Equation (31) says that we can measure the effects in response information 
due to the three inputs. This is evident from the fact that (30) tells us how 
to expand T'(uv,w;y) in its components. In addition we know that 


Hi) = Hise) + Tilwsy), (32) 
where 
T’(wiy) = Twig) + Awwy) + Aowy) + A’ Rwwy). (33) 


We see that controlling w in addition to uw and v, enables us to rescue the 
information transmitted between w and y from the noise, and to replace 
Hi.(y) with a better estimate of noise information, namely Hi..(Y). 

The transition to an N-dimensional input is now evident. In general, 


we have 
HY) = Hinscaly) + Tu, +e, 250). (34) 


mitted information, T'(uv,w, :-* , 254) can 


The (N + 1)-dimensional trans 
ts in the manner that we have described. 


then be expanded in its componen 
9. Asymptotic Distributions 
Miller and Madow (6) have shown that sample information is related 
to the likelihood ratio. Following Miller and Madow, we can show that the 
large sample distribution of the likelihood ratio may be used to find approxi- 
mate distributions for the quantities involved in multivariate transmission. 
Consider, for example, three-dimensional sample transmitted-informa- 
tion, T(u,v;y). We can test the hypothesis that T(u,v;y) is equal to zero. 
This is equivalent to the hypothesis that 
D(i,j,m) = Di) -Dm), (35) 
since T(u,v;y) is zero when input and output are independent. This hypothesis 
leads to the likelihood ratio [see reference (7)], 


nT os) I (na) 


Lae 36 
nu" I (ase) ( ) 
If we take logs, we obtain 
—2]log.A _ YY 
BEES 8 Be Bh (87) 


1.3863nT"(u,v;y). 


—2 log. A 
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For large samples, —2 log. A has approximately a x distribution with 
(UV — U(Y -— 1) degrees of freedom when the null hypothesis (35) is true. 
Thus 1.3863 nT’(u,v;y) is distributed approximately like x° if T(uv;y) is 
equal to zero. 


A more important problem involves testing suspected information 
sources. Suppose in our three-dimensional example, we assume that 


D(i,j,m) = pO) -p(j) -p(m). (38) 


This hypothesis leads to the likelihood ratio for complete independence in a 
three-dimensional contingency table, 


TI eo" Teo Te 
n™" Tr Coal 
After we take logs we find that 


(89) 


_21l0g.A = 3s — si — s; =-8n = 8 Sin 
HW + HO + HO) — Hwv,y) (40) 
1.3863nC"(u,v,y). 


For large samples —2 log, A has approximately a x distribution with 


(UVY = 0 — (0 = = V-D-(¥-1 degrees of freedom when the 
null hypothesis is true. 


We also know that 


Cuv,y) = Tuy) + Toy) + Tous). 4 

The likelihood ratio can be used to show that 1.3863 nT’(u;y) and 
1.8863 nT"(w;y) are asymptotically distributed like x* with (U = D(Y — 1) 
and (V — D(¥ = 1) degrees of freedom, respectively, if T(u;y) and T(v;y) 


are zero. To find the asymptotic distribution of Ti(u;v), we make the following 
hypothesis: 


D(G,j,m) = p,m) pj), 42) 
Where p,(j) is the conditional 
Now we have the ratio 


Ni id I (nin) I (=) 


probability of j given m. 


দ্‌ (43) 
LOR yl Din) 
—2 log. A 
1.3868 n = 8 = So — 8% Siim » (44) 


2 log. A = 1.3863nT, (u;0). 
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In this case —2 log. A has Y(U — D(V — D) degrees of freedom. In view 
of (41) we can write 

1.3863nC’(u,v,y) = 1.3863n[T"(u 0) + T0;y) + Ty(usd)]. (45) 
The quantities on the right side of (45) have degrees of freedom that sum to 
(UVY — U — VY -— Y + 2). Since this is the same number of degrees of 
freedom as on the left hand side of (45), the quantities on the right side of 
(45) are asymptotically independent, if the null hypothesis, 

D(G,j,m) = DO) -D(j) pm), 

is true. 
This means that as an approximation we can test T'(u;y), T’@;y) and 
T’(u;v) simultaneously for significance under the null hypothesis we have 
stated. The test is very similar to an analysis of variance. We can see the 
similarity by applying the test to the data from our example in section 5. 
The significance tests will be made on the quantities in equation (45). To do 
this we need to compute C"(u,v,y) and T’(u;v), since these terms were not 
discussed in section 5. First we note that C’(u,v,y) is the total amount of asso- 
ciati6n in the stimulus X response X preponse table. We have 


Cluny) = 2S + Sin _ Sic Si Sn) 
C’(uv,y) = 69055. 


We also need T(u;v), the information transmitted from presponses to stimuli 
with responses held constant. This measures how successfully the presponses 
predict the auditory stimuli. Since stimuli were chosen at random, we do not 
expect much transmitted information here. The computation goes as follows: 


T(u) = Sn — Sim ™ Sin + Sun + 
= Tus) + A'(uy), 
41435. 


|| 


We may now put our computed values for C'(u,v,y), T(u;y), T'(;y) and 
T’(u;v) into equation (45) and perform the X* tests. The results are summarized 
in Table 3. We have not attempted to calculate the significance level of 

’(4,v,y) because we do not have enough data to sustain the 88 degrees of 
freedom. The same criticism can probably be leveled at our test for T(u;v). 
In any case Table 3 shows that the only significant effect in the experiment 
is the presponse-response association. 

One interesting fact that the analysis brings out clearly, is that we 
cannot decide whether an amount of transmitted information is big or small 
without knowing its degrees of freedom. In our example we find that T’(u;v) = 
414 bits, while T"0;y) = 218 bits. Yet T'@;y) is significant and T;(u;v) 
is not. The reason lies in the difference in degrees of freedom. Miller and 
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TABLE 3 


Teble of Transmitted Information 


Transmission Component -2 loge at P 
Stimulus -Response T'(u;jy) 10.016 12 | >.50 
Presponse-Response T'(v;y) 37.844 16 <.0l 
Presponse-Stimulus Ty(lu;v) 71.802 60 =.10 
Total C(u,v,y) 119.664 | 88 | 

ee SAE 


Madow (6) have discussed the amo 
Measures due to degrees of freedom, 

In Table 3, we tested Tus); 
stimuli with responses held constan 


unt of statistical bias in information 
and have suggested corrections. 

the association between presponses and 
t. This association is broken down still 
TABLE b 


Table of Transmitted Information 


Transmission Component -2 loge Lc FE oR ফু 
Presponse-Stimulus T'(ujv) 20.853 12 >.05 
Interaction A'(uvy) 50.948 bid 
Total Ty(u;v) 71.802 60 =.1h 


hats Probability not estimated. 


OE Ymptotic distribution is not chi-square. All 
“terms are distributed li i 0 variables each of which 


ow ~ 


10. 


11. 
12. 


. Newman, 
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RANDOM FLUCTUATIONS OF RESPONSE RATE* 


WiLLiAM J. McG 
COLUMBIA UNIVERSITY 


A simple model for fluctuating interres 
studied. It involves a mechanism that generates rel 


each of which can trigger off 2 response after a random delay. The excitations 
are not observable, but their periodicity i. 


he eTlOdicity is reflected in a regular patterning 
of responses. The Probability distribution of j 


derived and its properties are analyzed. S 
examined. 


A number of behavioral systems 
sponses that recur regularly in time. 


1S an attempt to examine the properties of an elementary 

চু Producing noisy fluctuations in otherwise constant time 
i his paper i = tse au 

Lincoln Laboratory, T gmpleted hile the writer was a Visiting summer scientist at the 


paper 
mechanism for 


This arti g 
icle appeared in Psychometrika, 1962, 27, 3-17. Reprinted with permission. 
104 
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intervals. Despite its simplicity, the mechanism can duplicate a variety of 
observed phenomena, ranging from sharply peaked and symmetrical distri- 
butions of interresponse times to highly skewed distributions, and even 
completely random responding. Moreover, all these behaviors can be elicited 
from the same mechanism by altering the rate at which it is excited. 


Periodic Excitation 


We begin by examining interresponse times that are nearly constant. 
The key to this regularity, we assume, is some sort of periodic excitatory 
process that triggers a response after a short random delay. Even when the 
excitations are not observable their effects are seen in the regular intervals 
they impose between responses. The periodic mechanism proposed here is 
diagrammed in Fig. 1, which also illustrates our notation. 


RR; Rড 

ৰ 
[5 S r | | 
|! l | 
| I | 
ml Tr | r =! 

| l 

Ej Ez E3 

FIGURE 1 


ielding variable interresponse times with a periodic com- 
ervals 7, but are subject to random 


line is the time axis. 


Stochastic latency mechanism Y 
ponent. Excitations (not observable) come at regular int 
delays before producing responses. Heavy 


E and R denote excitation and response respectively. The time interval 
between two successive responses is a random variable and is called t. The 
analogous interval (or period) between excitations is a fixed (unknown) 
constant 7. Excitation and response almost never coincide in time. Conse- 
quently a response will almost always be located between two excitations, 
and its distance from each excitation can be expressed as two location co- 
ordinates. The first of these, 7, is the delay from a response to the next fol- 
lowing excitation. The second, 8, is the corresponding interval between a 
response and the excitation that immediately precedes it. 

The basic random quantity in Fig. 1is s, and our problem is to deduce 
the distribution of t when the distribution of sis known. Accordingly, suppose 
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that s has an exponential distribution as w 
times were completely random. Let 


6) f(s) = 2e™ 


where f(s) is the frequency function of 8S, and Misa positive constant, i.e., 
the time constant. Equation (1) then describes a very simple delay process 
in which the probability of a response during any short interval of LU As 
following excitation is constant and equal to MAs (see Feller [4], p. 220). 
This defines what we mean by “completely” random; the instantaneous 


Ould be the case if interresponse 


1 


compared with the period betwee 


is practically certain to Occur before E, comes along, and the tail of the 
distribution of s never really gets tangled with the next following excitation. 
When it happens that Nr is not large, a simple adjustment of (1) is required 
in order to bound S between zero 


and 7, without changing its characteriza- 
tion as a completely random interval. 


Distribution of Interresponse Tin 


Our main results are given in (2) and (3), which d 
distribution of the mechanism outline 
Jig. 1. The density functior 
between two SuCcce 


tes With a Periodic Component 


escribe the probability 
d in the first se tion and pictured in 


1 describing the distribution of the time interval t 
sSive responses is 


~~ Sinh At ES 
2) 0) = i) 
Lt yp Ae” b> Hi; 
2v 
in which » is a constant given by » = e- 
Skew 


ed and has a well-d 
Whene 
the distrib 


“i The distibution 3 evidently 
efined maximum overt = ;. 
Ver Ar happens to be large enou 


£h so that » is negligibly small, 
ution of interrespons 


€ time in (2) simplifies to 


(3) I - n= enn, UTE 
Equation (8) is the well-known Laplace density function [1]. Tt is symmetri- 
cal and Sharply Peaked over t = 


7, and describes the behavior of the latency 
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mechanism when the intervals between successive responses are dominated 
by the periodic component r. “Noise” introduced by the random component 
must then be small in comparison with the periodicity generated by the 
excitatory process. 

The approximation in (3) is easily rationalized if 1/M is considered as 
measuring the magnitude of the random component. In that case 1/MNr 
measures the size of the noisy perturbation relative to the period between 
excitations. Hence the parameter v will go toward zero whenever the ratio 
1/Ar gets small, i.e., whenever the random component is effectively small. 
It is not obvious that (2) approaches the Laplace distribution as »v disappears, 
but a brief study of (2) shows that this is in fact what happens. 


Proof of the Distribution* 


We shall now show that (2) is the correct form of the distribution of 
interresponse times when responses are triggered by periodic excitations as 
shown in Fig. 1. 

ce adjusted to hold s between zero and r. This 


First of all, (1) must b 
is easily handled. We begin with an excitation and simply cycle the exponential 


distribution back to the origin as soon as s reaches 7, letting the distribution 
continue to run down until it reaches T again, and repeating the process 
ad infinitum. The ordinate corresponding to any point s between zero and 
+ will then be given by 


10s) রে NG + at odes + gf ERAN + ৰ -) 


=e “(U+r+r+.-:-). 
Consequently the position of the response in the interval between excitations 
will be distributed as 
Ne. fis 
(1a) d= To ET 


The distribution of 7, the interval from the response to the next following 
= 1 — Tr. Substituting 


excitation, is NOW determined, since, from Fig.l, s = 


in (la) yields 


Nw r 
4 ET 38 0 ক 
Evidently r and s are perfectly (and inversely) correlated in the same 
excitation period. On the other hand only one response can occur between 
two excitations. Hence, when intervals between responses are analyzed, r 
will belong to one excitation period and s will belong to a later one, thus 
making r and 8 independent for determining t. 

It should be clear that t is not just the sum of r and s although a cursory 

*The writer is indebted to a referee for suggesting several excellent ways to simplify 


the original proof. 
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examination of Fig. 1 leaves that impression. The trouble with the impression 
is that several excitation periods may separate R, from R; . In other words, 
Tesponses are not forced by excitations. A new excitation may come along 
before the response is emitted. We have drawn Fig. 1 as though response R: 
fell into the excitation period following R, , but a moment’s reflection suggests 


that things might not happen so neatly. To deal with this nasty eventuality, 
Wwe shall define t as 


0) t=hr+rt+s, 


Where r is taken as the time interval betw 
excitation, 


T of periods in which no response 
ion periods between R, and Ro. 


An interval beginning with an excita 


tion and terminating in a response 
TAY span several excitations before th 


Subsequent response can be resol 
(i) the number of excitation period: 


Rin the period between the last two excitations. In view of the independence 
of k and s, we can write 


Mire. pn DEE 
Ae = Pk) — — 


1-—-v 
Where P(kr) is the Probability of a particular value of kr. We find that 
(6) 


Pi) = #0 = v). 
In other Words, the distribution of fr is geometric with ordinates spaced out 
at Successive multiples of 7. 


2 


All three components of (5) are independent. Moreover, the variables 
Tf and s form a unit that is the same for each value of %. Consequently, (5) 
Can be amended to read 


(69) t= kr+y, 


ude = Ct T+ s < 27, and kr has the Seometric distribution given by (6). 
he distribution Of y is obtained from the convolution of r and s. After 
Some simplification we find that 


(7) 1). = [: sinh My O<Y<r, 


Csinh M27 — y) TREY SBT; 
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where 
Av 


at CES 


The distribution of t will depend on the number of excitations between 
the pair of responses that bound each interval. This number fixes k, and it 
follows that each change in Kk will define a new component of the distribution 
of t. Tt will be convenient to describe each component separately by linking 
it to the number of excitations in the interval. The density function of the 
kth harmonic component of f(t) will be indicated as f(y), since k and y 


determine t. Equations (5a) and (6) yield 
(8) hh) = 0 -— HOW. 


) is the density function of interresponse times with a 
r of responses. This component of f(t) 
er the interval 0 < t < 2s. The average 


For example, fo(y 
single excitation between each pai 
has # equal to zero and is defined ov 
interresponse time in the interval is 7. 


The first harmonic component f(y) refers to interresponse times with 


just two excitations between each pair of responses. Hence hk is unity and 
fi(y) spans values of t between T and 3r. The average interresponse time 
is 27. Higher harmonic components are defined in the same way. 

The foregoing makes it evident that for values of t > Tr the density 
function has contributions from two harmonic components in each interval 
corresponding to the length of an excitation period. The pair of contributors 
will change as we proceed away from the origin in multiples of 7, but every 
element of density in f(1) after t = r will turn out to have two components. 
Specifically, 

(9) 0) = fy) + fly — 0) 
If the densities on the right-hand side of (9) are replaced by equi 
expressions determined from (8), it is easily shown that 


TEY SLT. 


valent 


1 +o, Arey) 
(9a) {0) = ট্যঠ Ae ট 
ly another way of writing t, and it is apparent 
s of f(t) interlace themselves in a way that 
distribution of t: 


Now recall that kr + y is simp 
that the harmonic component 
produces a surprisingly simple expression for the 


NM sinh A LE 
= 
Jd) = 
Ere 2 


This is (2) and the proof is complete. 


*e6iD| 51 1V aiaUyMm 
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ন i K Ee _-Es 
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ডু a “ন ও LL A= sasuodsaJ ueamjaq |oAiajul % 
1X us = 
K 

v Ve! EI ‘asuodseJ juanbasqns 45113 s+ 2 

ff I Si LE গ- "স 0] u0!4D}!2xe wo} |oAseyul 

(t= 1) fe ane = *sesuodseJ 
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Once the answer is known, a simpler proof can be established via a 
moment generating function. Table 1 gives the moment generating functions 
for r, s, and kr, all of which are easy to work out. The theorem governing 
moment generating functions for sums of random variables (see Hoel [8], 
or Mood [10]; our notation follows Hoel) allows us to write 


(10) A(6) = ALANA (G), 


where 11,(6) is defined as 
(11) M0) = [ 10 at. 
“0 


The generating functions in Table 1 are now substituted for the corresponding 


terms on the right-hand side of (10), and we obtain 


Br 
|. = 


(12) A/,(9) = Ee 70 Tey, 


ng function of the distribution of interresponse 
Iso the m.g.f. of (2), a fact that is easily demon- 
(11). The properties of the Laplace 
distribution having 


This is the moment generati 
times. It happens that it is a 
strated by substituting (2) for f(0) in 
transform assure that (2) will be the only continuous 
the required m.g.f. [12, 13]. 


The Laplace Distribution 
ase of (2) occurs when the distribution of inter- 
response times is dominated by a strong periodic component. The net effect 
of this restriction is that (2) is transformed into the Laplace distribution. 
Consequently the Laplace distribution characterizes the “noise” in a class 
of simple timing devices. The essential feature of these devices is that they 
are self- compensating. Intervals that are too long tend to be followed im- 
mediately by intervals that are too short and vice versa. (The correlation 
between adjacent interresponse times for the mechanism pictured in Fig. 1 
is —.50). This type of regulation is really what enables us to infer that regular 
excitations must be occurring. 
The approach of (2) to the Laplace distribution is easily shown via its 
moment generating function. Consider (12) when » goes to zero. We have 


immediately 


An interesting limiting ¢ 


(13) Mi,-.(0) = মৰে" 


This is the m.g.f. of the Laplace distribution. More specifically, (3) has (13) 
as its m.g.f. The proof may be established by substituting (3) for f(t) in (11). 
+ The restriction » = 0, which leads from (2) to the Laplace distribution, 
implies that k, the number of empty excitation periods, must always be 
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zero. This follows from the fact that the geometric distribution in (6) col- 
lapses when » = 0. Hence tin Fig. 1 will be just precisely the sum of r and s, 
and we can ignore the possibility of empty excitation periods. Two responses 
are necessary to define t. Hence there must also be two independent occur- 


rences of s. Call them s, and s2 corresponding to R, and R, , respectively. 
Refer now to Fig. 1 and observe that 


t=74+s, = =; L783 = 8) 4 


Consequently, t — 1 is distributed as the difference of two exponential 
variables and we can write its moment generating function as 


Mi_(0) = MAMA). 
The m.g.f. of the exponential distributio 
given in Table 1 for the variable kr 
for 11,(6), we obtain 


nN is, of course, very familiar and is 
T+ s. Substituting this exponential m.g.f. 


[| 


Mi_,(0)y™ = (1 — 9/0 + 09/N) 


1 -— (0/1, 


[| 


Which is, as we have alread 
Evidently the Laplace dens 
of the difference between 

ignored in most texts on s 


between excitations gets ver 


in (12) as + approaches zero with A fixed, 
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INTERRESPONSE TIME tf (standard units) 


FIGURE 2 


General distribution of interresponse times with arbitrary random and periodic parts. 
The curve is a plot of equation (2) in the text, withAand + = 1. Dashed lines are harmonic 
components of the distribution. 


Variances are given in Table 1. The formulas establish that the component 


within harmonics (i.e., the variance of y) disappears 2S 7 vanishes, and the 
entire variance becomes concentrated in the differences between harmonics. 
This implies that the probability distribution of f(t) must congeal around 
its harmonic peaks (see Fig. 2) when T goes to zero, and that each peak then 
contributes a “line” of density to the resulting exponential distribution. 
Intuitively, the limit in (14) means that no delay can be contributed by 
the latency between a response and the next excitation. That excitation 
is instantly available. Hence r in Fig. 1 vanishes and the entire interval 
is consumed by the latency between excitation and response, which we have 


assumed to be exponential. 


Applications 


Fig. 3 presents a frequency distribution compiled from a long series 
ber of the optic nerve of imulus. 


of action potentials recorded on a single fi 
The narrow distribution demonstrates that the data are periodic and the 
periodicity seems to originate in the refractory period of the nerve fiber. 


The mechanism, however, is not well understood. In this particular case, 
the regular sequence of action potentials was achieved by dissecting out a 
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FiGurE 3 


t. The nerve fiber adapted 
ncrease in period from 261 to 


S from the linear drift. Smooth curve 
is a Laplace distribution. 


and shining a beam of 
the ommatidium, to which the fiber w 


y illumination, resulting 
he basic period from about 261 milliseconds to 291 
milliseconds. This change i 


i 
he Preparation and recordi 


bs 
XL LE were made by C. G. Mueller. The data were recovered 
and analyzed by the writer with the assistance of Michael S. Kennedy. 
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A normal curve fitted to the same data would have high shoulders and 
a flat top. This fact then defines the distribution of interresponse time as 
being leptokurtic. Another illustration is provided in Fig. 4 and is taken 
from data reported by Hill [7]. The distribution was obtained by measuring 
intervals between successive bar-presses made by a white rat. The data 
were taken on the 93rd day of conditioning with a reinforcement schedule 
in which payoff was contingent on delaying at least 21 seconds from the 
last previous response. The normal approximation to Hill's data is shown 
in Fig. 4 by the dashed frequency distribution in the background. This 
normal curve was fitted by matching mean and variance to the data. Re- 
sponses in the 0-3 second class interval were not used for this purpose because 
bursts of responses immediately after reinforcement are believed to be un- 
related to the main effect. In any event the leptokurtic character of Hill’s 
data is evident, and it suggests that the long regimen of training (184 hours) 
on the time discrimination problem made Hill's rat into a fairly accurate 
Laplace-type clock. We are led naturally to conjecture about how the rat 
constructs 7. Does it happen internally via some type of neurological clock or 
a stereotyped sequence of movements? 
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Skewed distributions of interresponse times with the appearance of 
(2) (see Fig. 2) are found often in the literature, usually in connection with 
high speed responding. Fig. 5 is taken from Brandauer [2] who studied 
Tesponse sequences generated by a Pigeon pecking at a small illuminated 
target. Reinforcement was controlled by a high speed flip-flop and the bird 
Was reinforced whenever a peck happened to coincide in time with a particular 
one of the two states of the flip-flop. Consequently, the probability of rein- 
forcement was determined by the proportion of time the flip-flop spent 
in that state, and the net result was that every response had the same (low) 
Probability of reinforcement. The Pigeon generated an average rate of 5.3 
Tesponses per second during the run shown in Fig. 5 which covers approxi- 
mately 1000 responses. If the sharp peak in Fig. 5 is in fact created by a 
periodic excitatory mechanism, we would conclude that excitations were 


coming even faster than 5.3 times a second. This follows because the average 
length of the interval 
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In a recent paper, Hunt and Kuno [9] present several distributions of 
uring spontaneous activity of single fibers 
he data run the gamut from the Laplace to 
the exponential, including several examples of what appears to be our skewed 
distribution (Fig. 2). The effect is exactly what might be expected, if the 
Same general response system were subjected to varying rates of periodic 


excitation. 


interresponse times recorded d 
in the spinal cord of the cat. T 


Discussion 


It would be hard to find levels of behavior further apart than single 
fiber activity and overt responding. Yet the distributions of interresponse 
times presented in this paper seem applicable to both, and in the limited 
view afforded by a study of the time between responses, neither system looks 
ter organized than the other. 

When we find stochastic mechanisms like Fig. 1 operating in overt 
responding, it probably means no more than that complicated systems of 
neurons can be organized to do very simple jobs. Even 50, the noise in an 
Organization may give 2 clue to the nature of the organization, and thus 
provide a way to study it. When we ask, as we did earlier, how the animal 
constructs 7, we have to find a way that is compatible with our conception 
of the mechanism as dictated by the noise. 

The delineation of simple periodic mechanisms affords similar insights 
into information coding in single nerve fibers. Knowledge of the general 
form of the coding mechanism indicates what kind of noise higher centers 
have to face, and suggests possible ways for detecting periodicity in the 
noise. For instance, the Laplace distribution presents very interesting prob- 
lems to a device attempting to estimate its parameters [1]. 

The latency mechanism considered in this paper barely scratches the 
surface of the possibilities. It turns out that our mechanism has indistinguisha- 
ble excitations. Whenever a new excitation appears before there is a response 
to an earlier one, it makes no difference whether the new excitation replaces 
the old one and reactivates the response trigger, or is simply blocked by 
the excitation that is already working. Once this is clear other suggestions 
for summating excitations or for parallel channeling present themselves. 
For example, there are 2 number of harmonic distributions of interresponse 

These distributions show clusterings of inter- 


times in the literature [3]. ৫ 
response times at multiples of a fundamental period, and hence seem to 
closely related to (2) which also has harmonic components. But something 


else is required, and it is not entirely clear yet what that something else is. 


more complicated or bet 
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SENSITIVITY TO CHANGES IN THE INTENSITY 
OF WHITE NOISE AND ITS RELATION TO 
MASKING AND LOUDNESS! 


GEORGE A. MILLER 


tensity of a random noise was determined over a 
ble increment in the intensity of the noise is of 
the same order of magnitude as the just detectable increment in the intensity of pure 
tones. For intensities more than 30 db above the threshold of hearing for noise the size 
in decibels of the increment which can be heard 50 percent of the time is approximately 
constant (0.41 db). When the results of the experiment are regarded as measures of 
the masking of a noise by the noise itself, it can be shown that functions which describe 
intensity discrimination also describe the masking by white noise of pure tones and of 
speech.” It is argued, therefore, that the determination of differential sensitivity to 
intensity is a special case of the more general masking experiment. The loudness of the 
noise was also determined, and just noticeable differences are shown to be unequal in 
subjective magnitude. A just noticeable difference at a low intensity produces a much 


smaller change in the apparent loudness than does a just noticeable difference at high 
intensity. 


Sensitivity to changes in the in 
wide range of intensities. The just detectal 


, is one of the oldest and most important 


Differential sensitivity to intensity 
ious experiments have concerned 


problems in the psychophysics of audition. But prev 
themselves mainly with sensitivity to changes in the intensity of sinusoidal tones, and 


if we want to know the differential sensitivity for a complex sound, it is necessary 
either to extrapolate from existing information, or actually to conduct the experiment 
for the sound in question. This gap in our knowledge is due to expediency, not over- 
sight. The realm of complex sounds includes an infinitude of acoustic compounds, and 
experimental parameters extend in many directions. Just which of these sounds we 
select for investigation is an arbitrary matter. Of the various possibilities, however, 
one of the most appropriate is random noise, a sound of persistent importance and one 
which marks a sort of ultimate on a scale of complexity. 

Although the instantaneous amplitude varies randomly, white noise is perceived 
as a steady “‘hishing” sound, and it is quite possible to determine a listener's sensitivity 
to changes in its intensity.* The present paper reports the results of such determinations 
for a range of noise intensities. 

Apparatus and Procedure 

andom ionization in a gas tube, was varied 


A white-noise voltage, produced byr: 
ances provided by a General Radio 


in intensity by shunting the line with known resist. 
This article appeared in J. Acoust. Soc. Amer., 1947, 19, 609-619. Reprinted with 


permission. 
1 This research was conducted under contract with the U.S. Navy, Office of Naval 


Research (Contract NS5ori-76, Report PNR-28). | 
2 J. E. Karlin, Auditory tests for the ability to discriminate the pitch and the loudness of 


noises, OSRD Report No. 5294 (Psycho-Acoustic Laboratory, Harvard University, August |, 
1945) (available through the Office of Technical Services, U.S. Department of Commerce, 


Washington, D. C.). 
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Decade Resistance Box. A schematic diagram of the equipment is shown in Fig. 1. 
The attenuators were used to keep constant the values of source and load impedance, 
Ry and Ry, surrounding the shunt resistances, Ri and Rs, since these values must enter 
into the computation of the increment which is produced by the insertion of the variable 
resistance, Rs. The whole system can be represented by the equivalent circuit, also 
shown in Fig. 1. For this circuit, the size of an increment in voltage AE}, is given by 


AEL RoRsRy, 
EL RilRR; + Ry + Rg) + RI(R; + Ra) 


not introduce amplitude distortion after the increments are 


If the system does 
expressed in decibels, can be taken as 


produced, the increment in sound pressure, 
20 logo (1 + SELlE1). 


Throughout the following discussion the intensity of the noise will be stated in 


terms of its sensation level—the number of decibels above the listener's absolute thresh- 
old for the noise. If the sound-pressure level of the noise is taken to be the level 
generated bya moving-coil earphone (Permoflux PDR-10) when the voltage across the 
earphone (measured by a thermocouple) is the same as the voltage required for a sinu- 
soidal wave (1000 cycles) to generate the given sound pressure in a volume of 6 cc, then 
the absolute threshold for the noise corresponds to a sound pressure of approximately 
10 db re 0.0002 dyne/em*. Thus the sensation level can be converted into sound- 


pressure level by the simple procedure of adding 10 db to the value given for the 
ise was relatively uniform (5 db) between 


sensation level. The spectrum of the not 
150 and 7000c.p.s. The measurement and spectrum of the noise transduced by the 


earphone PDR-I10 has been discussed in detail by Hawkins. 

Once the sound-pressure level and the relative size of the increment in decibels 
are known, the absolute value of the increment can be computed. Those interested in 
converting the decibels into dynes/em* will find the nomogram of Fig. 2a considerable 
convenience. A straight line which passes through a value of Alin decibels on the left- 
hand scale, and through a value of the sound pressure on the middle scale, will intersect 
the right-hand scale at the appropriate value of APin dyne/em®. EBED the stimulus isa 
plane progressive sound wave, its acoustic intensity In watts/em* is proportional to the 
square of the pressure: I = kp, 

The peak amplitudes in the wav. 
able to expect, therefore, that the size 


e of a white noise are not constant. It is reason- 
of the just noticeable difference might vary as a 


function of the distribution of peak amplitudes in the wave. In order to evaluate this 
aspect of the stimulus, a second experiment .was conducted. The noise voltage was 
passed through a square-wave generator (Hewlett Packard, Model 210-A) before the 
increments were introduced. The spectrum and subjective quality of the noise are not 
altered by the square-wave generator, but the peak amplitudes are “squared off“ at a 
uniform level. The resulting Wave form might be described as a square-wave modulated 


randomly in frequency. 


“The masking of pure tones and of speech by white noise," in a report 
signals by noise, OSRD Report No. 5387 (Psycho-Acoustic Laboratory, 
1945) (available through the Office of Technical Services, 


8 J. E. Hawkins, 
entitled The masking of 
Harvard University, October 1, 
U.S. Department of Commerce, Washington, D.C.). 
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magnitude for noise as it is for tones, at least at the higher levels of intensity." Atthe 
lower intensities the discrimination for 


a noise stimulus may be somewhat more acute 
than for tones. 


Implications for a Quantal Theory of Discrimination 


The notion that the difference limen depends upon the activation of discrete 
neural units is not new. It is suggested by the discreteness of the sensory cells them- 
selves. Only recently, however, has evidence been obtained to support the assumption 
that the basic neural processes mediating a discrimination are of an all-or-none 
character. 

The principal evidence derives from 


the shape of the Psychometric function. 
Stevens, Morgan, and Volkmann! 


present the argument in the following way: 


We assume that the neural structures initial 
Sensory continuum are divided into functionally dist 
excites a certain number of quanta will ordinarily 
excite these quanta and leave a small Surplus ins 
quantum. This surplus stimulation will Contribut 
bring into activity the added quantum needed fo 
this left-over stimulation or Surplus excit 
fluctuation in sensitivity] is large compare 


ly involved in the perception of a 
inct units. ... The stimulus which 
do so with a little to spare—it will 
uflicient to excite some additional 
€, along with the increment, M1, to 
r discrimination. ... How much of 
ation are we to expect? If [the over-all 
d to the size of an individual quantum, it Is 
evident that over the course of time all Values of the surplus stimulation occur equally 
often. ... From these considerations it follows that, if the increment is added instan- 
taneously to the stimulus, it will be Perceived a certain fraction of the time, and this frac- 
tion is directly Proportional to the size of the increment itself. 


When the increments are added 
finds it difficult to distinguish one-quan 
which 


to a continuous Stimulus, however, the listener 


tum changes in the stimulus from the changes 
are constantly Occurring because of fluctuations in his sensitivity. In order to 


make reliable Judgments, the listener is forced to ignore all one-quantum changes. 
Consequently, a Stimulus increment under these conditions must activate at least two 
additional neural units in order that a difference will be perceived and reported. Thus, 
in effect, a constant error of one quantum is added to the psychometric function. 
The psychometric function predicted by this line of reasoning can be described 
in the following way. When the stimulus increments to a steady sound are less than 
Some value A/y, they are never reported, and over the range of increments from 0 to 
AIy the Psychometric function remains at 0 percent. Between A/, and 21/9 the 


es directly with the size of the increment, and 
ction is illustrated by the solid line of Fig. 4. 
€ which is reported 50 percent of the time is 
the quantal increment. If we take this value as defining a unit 
Us, all the psychometric functions obtained for the two listeners 
a single function. In other Words, we can adjust the individual 
against which the functions are plotted in order to make all the functions 
at the 50 percent point. In Fig. 4 the size of the relative increment in sound 


Increment in the stimul 
Can be combined into 
intensity scales 
Coincide 


hs 4 Al * Kk , 
i Of the modern Investigations, only Dimmick’s disagrees strikingly with the values 
ere Jue for the higher intensities. F. L. Dimmick and R. M. Olson, The intensive differ- 
“pS Ruditionsh aeousr. Soe, Am, 1941, 32. 517-525. 
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FiGURE 4 
The 32 psychometric functions combined in a single graph. Values of A PIP heard 50 percent 
of the time are designated as 1.50. and the datum points on each function are plotted relative 
to this value. Each point represents 100 judgments. 


pressure, AP/P, has been adjusted so that the increment which was heard 50 percent of 
the time is plotted as 1.5 times the quantal increment. 

Figure 4 shows that the characteristic quantal function was not obtained in this 
experiment. The data are better described by the phi-function of gamma (the normal 
probability integral) indicated by the dashed line. 

The classical argument for the application 0 
to the difference limen assumes a number of small, 
independent, and which combine according to chance. When these variables are con- 
trolled or eliminated, the step-wise, “quantal relation is revealed.! If this reasoning 
is correct, then the deviations of the points in Fig. 4 from the quantal hypothesis should 
be attributable to the introduction of random variability into the listening situation. 

Is there any obvious Source of randomness in the experiment? Certainly there 
is, for white noise is a paradigm of randomness. The statistical nature of the noise 
means that the calculated value of the increment is merely the most probable value, 
and that a certain portion of the time the increment will depart from this probable 
value by an amount sufficient to affect the discrimination. And in view of the fluctuat- 
ing level of the stimulus, it would be surprising indeed if the rigorous experimental 
requirements of the quantal hypothesis were fulfilled. This situation demonstrates the 
practical difficulty in obtaining the rectilinear functions predicted by the quantal 
hypothesis. Any Source of variability tends to obscure the step-wise results and to 
produce the S-shaped normal probability integral. 

It should be noted, however, that the shape of the psychometric function is only 
antal argument. According to the hypothesis, the slope 


f the cumulative probability function 
indeterminate variables which are 


one of the implications of the qu 


and W. R. Garner. Effect of random presentation on the psychometric 


1G. A. Miller 
al theory of discrimination, Am. J. Psychol., 1944, 57, 45S1- 


function: Implications fora 
467. 


quant 
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Of the psychometric function is determined by the size of the difference limen for all 
values of stimulus-intensity. The Present data accord with this second prediction. 
The standard deviations of the probability integrals which describe the data are 
approximately one-third the means (or 0.5/4) for all the thresholds measured for both 
Subjects. This invariance in the slope of the function is necessary but not sufficient 


evidence for a neural quantum, and it makes possible the representation of the results 
in the form shown in Fig. 4. 


Symbolic Representation of the Data 


In order to represent 


the experimental results in symbolic form, the following 
symbols will be used: 


b numerical constant = 1.333, 
¢ numerical constant = 0.066 = Moll when I 1 
DL difference limen (just noticeable differer 
f frequency in cycles per second, 
{ sound intensity (energy flow), 
I~ sound intensity per cycle, 
ly sound intensity which is Just audible in quiet, 
m Sound intensity which is just masked in noise, 
Alg quantal increment in sound intensity = 0.66741, 
Alo increment in sound inten 
L loudness in Sones, 
M masking in decibels, 
Ny number of quantal increments above t 
R signal-to-noise ratio per c 
Z effective level of noise at 


0, 
ce) expressed in decibels, 


sity heard 50 percent of the time, 


hreshold, 
ycle at any frequency, 
any frequency. 


An adequate description of the data in Table I can be developed from the 
empirical equation 


My = 4 i TXT, 0) 
Where the quantal increment in the Stimulus 


Variable component. Since SI/;o—the incre 
time—equals 1.5AI/y, we can write 


"energy is assumed to have a fixed and a 
ment which can be heard 50 percent of the 


DL = 10 logio(l + Moll) = 
From (2)itis possible to compute the just notice. 
Of sensation level, although Wwe know onl 


0 

absolute values. When the computations are carried through, the values indicated by 
the solid curve in Fig. 3 are Obtained. The fit of this curve to the data is good enough 
to justify the use of Eq. (2) to obtai 


Itis interesting to note that at high intensities E 
known “Weber's Law,” 


Proportional to the i 


q. (1) is equivalent to the well- 
a just noticeable difference is 


tet Differential sensitivity charac- 
teristically departs from Weber's Law at low intensities, and Fechner long ago sug- 


ested a modification Of the law to the form expressed in Eg. (1).'2 The essential feature 


gah Helmholtz, Treatise on Physiological optics (translated by P. C. Southall from 3rd 


German ed., Vol. I. The Sensations of vision, 1911), Optical Society of America, 1924, PP. 172- 
181. 
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of this equation is the rectilinear relation between AI and I; the obvious difficulty is 
the explanation of the intercept value bly which appears in Eq. (1) as an additive factor. 
Fechner supposed that this added term is attributable to intrinsic, interfering stimula- 
tion which cannot be eliminated in the measurement of the difference limen. Body 
noises, the spontaneous activity of the auditory nervous system, or the thermal noise 
of the air molecules have been suggested as possible sources of this background stimula- 
tion, but proof of these possibilities is still lacking. For the present, therefore, we must 


regard Eg. (1) asa purely empirical equation. 
Relation to Masking 


There is an operational similarity between experiments designed to study 
differential sensitivity for intensity and experiments devised to measure auditory 
masking. This similarity is usually obscured by a practical inclination to ignore the 
special case where one sound is masked by another sound identical with the first. 

Suppose we want to know how much a white noise masks a white noise. What 
experimental procedures would we adopt? Obviously, the judgment we would ask the 
listener to make is the same judgment made in the present experiment. In the one case, 
however, we present the data to show the smallest detectable increment, while in the 
other we use the same data to determine the shift in threshold of the masked sound. 
When the masked and masking sounds are identical, the difference between masking 
and sensitivity to changes in intensity lies only in the way the story is told. 

A striking example of this similarity is to be found in the work of Riesz. In 
order to produce gradual changes in intensity, Riesz used tones differing in frequency 
by 3 cycles and instructed his listeners to report the presence or absence of beats. 
Although his results are generally accepted as definitive measures of sensitivity to 
of pure tones, it is equally correct to interpret them as measures 
another tone differing in frequency by 3 cycles. 

a of Table [. In this table we have presented 
and the size of the increment which can 
rmed to correspond 


changes in the intensity 
Of the masking of one tone by 

Let us, therefore, reconsider the dat 
in decibels both the sensation level of the noise 
be heard 50 percent of the time. How can these data be transfo 


with the definition of masking? 


First, consider that we are mixing two noises in order to produce the total 


magnitude / + AI. Since [is analogous to the intensity of the masking sound, I + MI 
must equal the intensity of the masking sound plus the intensity of the masked sound, 
f+ ZL ThUS Ln = AL, and from the definition of masking M we can write 


M = 10 logio(lmnllo) = 10 logio(1llo). 03) 
ars to be some basic significance to the quantal unit, whereas the 
ent of the increments is arbitrary, we will use the quantal 
Aly is defined as 0.667 times the value of the increment 


Because there appe 
Criterion of hearing 50 perc 
Increment Aly in Eq. (3). 

Which is heard 50 percent of the time. 


M = 10 logi(MIollo). (3a) 


Equation (3a) tells us, then, that the logarithm of the ratio of the quantal increment 
to the absolute threshold is proportional to the masking of a sound by an identical 


Sound. 
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TABLE II 
Masking of White Noise by White Noise. Quantal Increments in Decibels and the 
Values of Masking Obtained for Two Listeners as a Function of the Sensation Level 
of the Masking Noise. Computed Values of Masking According to Ey. (4). 


Quantal increment 


Sensation in decibels Masking obtained Masking 

level GM SNM GM SM computed 

3 db 2.37 db 2.37 db 1.61 db 1.61 db 1.66 db 
5 XZ ES 3.26 Ll8 1.88 
10 0.81 0.81 3.14 3.14 3.00 
2 0.67 0.61 4.22 3.80 3.76 
15 0.58 0.45 6.58 5.39 5:33 
20 0.33 0.37 9.00 9.54 8.99 
25 0.31 0.37 13.73 14.45 13.44 
32 0.27 0.27 20.06 19.97 20.25 
35 0.27 0.34 23.06 24.10 23:22 
45 0.29 0.30 33.33 33.53 33.20 
52 0.27 0.31 40.06 40.73 40.20 
55 0.27 0.34 42.97 44.10 43.20 
70 0.27 0.32 57.97 58.81 58.20 
82 0.22 0:32 68.32 70.81 70.20 
85 0.22 0.33 72.22 73.91 73.20 
100 0.19 0.27 86.43 88.06 88.20 


It is now possible to determine the values of M/y and lp from the information 
given in Table I, and to Substitute these values into Eg. (3a). The results of converting 
the differential thresholds into quantal increments and then into masked thresholds are 
given in Table I for the two listeners, and are shown in Fig. 5 where masking is plotted 
as a function of the sensation level of the masking noise. In addition, Table I contains 
Values of masking which are computed when Eqs. (1) and (3a) are combined: 


M = 10 logiol(e//l,) + p]. 4) 
For intensities 25 db Or more 


! above threshold, the masking noise is about 12 db more 
Intense than the masked noise 


ask whether these results co 


rrespond to the functions 
Mask tones or humar 


i 1 Speech. Fortunately, we are 
Answer this question. Hawkins? has measured the masking effects of noise on 
tones and speech with experimental conditions and equipment directly comparable 
with those used here. 


Suppose, for Purposes of comparison. 
We find over ? 


1000- 


We choose to mask a 1000-cycle tone. 
this particular white noise Just masks a 
Since the corresponding value is 12 db 


) 1 energy is concentrated at 1000 c.p.s. 
1S spread over the entire Spectrum. In order to compare the 


ng functions, therefore, we Can subtract 8 db from the level of 
the 1000-cycle tone. 


than when the energy 
forms of the two maski 
the noise which masks 
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FIGURE 5 
vise plotted in a manner analogous to masking 


Discriminable increments in intensity of w hite n' 
ained by Hawkins for the masking of tones 


experiments. Solid line represents function obt 
and speech by white noise. 


of 8 db in the noise level for Hawkins’ data for a 
a function of the corrected noise 
The correspondence between this 
ained in the present experiment 
falls too close to Hawkins* 


When we make this correction 
1000-cvele tone and plot the masking of this tone as 
ty, we obtain the solid line shown in Fig. 5. 

points obt 


intens 
curve, taken from Hawkins’ data. and the 
is remarkably close. The function computed from Eq.(4)f 
function to warrant its separate presentation in Fig. 

The choice of 1000 c.p.s. is not crucial to this correspondence. As Fletcher 
and Munson® have pointed out, a single function is adequate to describe the masking 
by noise of pure tones. if the intensity of the noise is corrected by a factor W hich is a 
function of the frequency of the masked tone. This factor is given at any frequency f 
by the ratio R of the intensity of the masked tone to the intensity per cycle of the noise 
at that frequency: R= Lull ~- Ris experimentally determined for all frequencies at 
intensities well above threshold—on the rectilinear portion of the function shown in 


Fig. 5. 


For noises with continuous spectra, the masking of a tone of frequency / can be 
attributed to the noise in the band of frequencies immediately adjacent tof." Con. 
Sequentlvy, it is convenient to relate the masking of a tone of frequency f to the intensity 
per cycle of the noise at f; and to express this intensity in decibels re the threshold of 
hearing at any frequency. This procedure gives 10 logo! ~1lo). which can be regarded 
as the sensation level at f of a one-cycle band of noise. The effective level Z of the 
noise at that frequency is then defined as 


Z = 10 log! ~lho) + 10 logo R. (5) 


Munson. Relation between loudness and masking, J. ucous. 


13 H. Fletcher and W. £ 


Soc. Am., 1937.9, 1710. 
1“ H. Fletcher, Auditory patterns. Rer. Mod. Phys.. 1940. 12, 47 65. 
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When the masking of pure tones is plotted as a function of Z. the relation between M 
and Z is found to be independent of frequency. A single function expresses the relation 
between M and Z for all frequencies. 

When we compare the function relating M to Z with the function obtained in 
the present experiment, we find that the sensation level of the noise is equivalent to 
Z + 11.8 db. Therefore, 

Ill = 15.14RU~IL). 


Substituting this expression into Eq. (4) gives 


M = 10 logi[RU~IL) + b]. (6) 


This equation, along with the functions relating R and lp to frequency, enables us to 
compute the masking of pure tones by any random noise of known spectrum. When 
10 logio(/~/1/o) is greater than about 15 db, b is negligible for all frequencies, and the 
masking can be computed more simply as 10 logioR + 10 logio(1~|1o). 

Hawkins’ results show that the function of Eq. (4) c 
scribe the masking of human speech by white noi 

Thus the correspondence seems complete. 
Sounds are identical, masking and sensitivity to changes in intensity are equivalent. 
The results obtained with identical masking and masked noises are directly comparable 
to results obtained with different masked sounds. 
fore, that the determination of sensitivit 
more general masking experiment. 


It is worth noting that this interpretation of masking is also applicable to visual 
sensitivity to changes in the intensity of white light. Data obtained by Graham and 
Bartlett!'® provide an excellent basis for comparison, because of the similarity of their 
procedure to that of the masking experiment, and because they used homogeneous, 
rod-free, foveal areas of the retina. When these data are substituted into Eq. (3) and 
plotted as measures of visual masking, the result can be described by the same general 


function that we have used to express the auditory masking by noise of tones, speech, 
and noise. 


an also be adapted to de- 
ত 


When the masking and the masked 


It is reasonable to conclude, there- 
Y to changes in intensity is a special case of the 


Relation to Loudness 


When Fechner adopted the Just noticeable difference as the unit for sensory 
a controversy which is still alive today: Are equally-often-noticed 

case of auditory loudness, the answer seems to be 

J-n.d.'s) at high intensities are subjectively much 

larger than j.n.d.’s at low intensities. Hl J y 


15 
a ER and N. R. Bartlett, The relation of stimulus and intensity in the human 
149-159. € Infuence of area on foveal intensity discrimination, J. exper. Psychol., 1940, 27, 
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In order to demonstrate that such is the case for noise as well as for pure tones, 
we need two kinds of information. We need to know the functions relating noise intensity 
to the number of distinguishable steps above threshold, and to the subjective loudness 
of the noise in sones. If these two functions correspond, Fechner was right and j.n.d.'s 
can be used as units on a subjective loudness-scale. If.they do not agree, Fechner was 
wrong, and the picture is more complex than he imagined. 


TABLE Il! 
Loudness and the Number of Quanta. Sensation-Level of Equally Loud 1000-Cycle 
Tone as a Function of Sensation-Level of Noise, with Corresponding Loudness in 
Sones. Data for 12 Listeners. The Last Column Gives the Corresponding Number 
of Quantal Units Above Threshold. 


Equally loud 
1000 c.p.s. 


Sensation- Sensation Stand. Loudness in sones No. of quanta 
level of noise level dev. Mean Stand. dev. above threshold 
15 db 14.2 db 4.6db 0.036 0.015- 0.081 13 
30 38.1 6.9 0.83 0.40 - 1.6 58 
45 57.9 9.1 4.8 3. Ed Ill 
60 74.2 8.2 17.0 9-26 163 
75 86.3 Tat 37 24 -47 216 
90 97.9 3.1 76 62  -88 268 


7) corresponding to a given sensation level 
the quantal increments against a scale of 
umber of quantal increments per unit 


The number of differential quanta N 
of noise is readily obtained by “stepping off ™ 
decibels. The procedure consists of finding the n 
of intensity and then integrating: 


No = fare - dl. (7) 
If we substitute for the size of the quantal increment according to Eq. (1), 
1 || 
tel dl = HATO $C: (8) 
le fs SAA ELS 


When we convert to logarithms to the base 10, insert the values for the constants, and 


solve in terms of masking M, we find that 
No = 3.49M + K. (9) 


mber of quantal increments is zero when I = I, and at this point 


We assume that the nu 
1.46 db. Therefore, K = —5.1. Values of Ng obtained by 


Eq. (4) indicates that M = ore, 
Eq. (9) are given in Table II, and plotted in Fig. RAY 
The loudness in sones Was determined by requiring listeners to equate the loud- 


ness of the noise with the loudness of a 1000-cycle tone. The two sounds were presented 
alternately to the same ear, and the listener adjusted the intensity of the tone. Five 
equations were made by each of twelve listeners for the six noise-intensities studied. 
The result of this experiment—the level of the 1000-cycle tone which sounds equal in 
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Sensation-level of noise in decibels 
FiGuURE 6 £ 
Observed and computed values of the loudness level OF white noise. Standard deviations of 
the values for 15 listeners are indicated by the lengths of the vertical bars. 


loudness to the noise— defines the loudness level of the noise. With these data, which 
are tabulated in Table HI and plotted in Fig. 6, the loudness in Sones is determined 
from the loudness-scale which has been constructed for the I000-evele tone. The 
Values in sones from Stevens’ loudness-scale!s are included in Table HI. Table I 
also gives the standard deviations of the distributions of loudness levels obtained for 
the 12 listeners. 

Loudness can also be computed. Fletcher 


and Munson developed a procedure 
for calculating loudness from the masking whi 


ch the sound produces. When this 
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Mparison of the number of discriminatory quanta with the loudness of white noise. Just 
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awkins" data for the masking of pure tones by noise, we get 


procedure is applied to H 
agreement between computed and experi- 


the computed values shown in Fig. 6. The 
mental results is quite satisfactory. 

We are now equipped to present t 
curve shows the number of quantal units as 


2 = 


he two functions shown in Fig. 7. The solid 
a function of sensation level. The dashed 


100 ] 


10 


Loudness-level of noise in sones 
ts 
o° 


0.1 


tan" 
9 2 
Masking (Al/lo in decibels) 


FIGURE 8 
Relation between loudness and masking for white noise. 
The discrepancy between these two curves affirms 


Sones. 
Loudness and the number of just noticeable dil- 


curve shows the loudness in 
the error of Fechner's assumption. 
ferences are not linearly related. | OO 

ariables are both related to a third, it is 


When, as in the present case, twO V' 5 | 
possible to determine their relation to each other. Stevens!” has used Riesz's data for 


11 5. 5S. Stevens. A scale for the measurement of a psychological magnitude: loudness, 


Psychol. Rev., 1936. 43, 405-416. 
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Pure tones to arrive at the empirical equation L = kN2.2 
the tone in sones, k is the size in sones of the fi 
tinguishable steps. When we parallel Stevens’ com 


» where L is the loudness of 


putation with the data for noise 
describes the relation rather well over 


masking is related to sensation level; in 
nsation level. The relation of masking to 
0 functions. In Fig. 8 it can be seen that 
Tr well. The loudness of a white noise in- 
he masking produced by the noise on itself, 
ncrement in intensity. In whatever 

s It is Obvious that faint j.n.d.’s are smaller 
are not equal units along a scale of loudness. 
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Correction 

In 1963 D. H. Raab, E. Osm 

is written as if the masked and m 

fact, however, the two noises wi 
Pressures added in phase; 

Pressures, not the sum of th 

recomputed from Table I 

more above threshold, the 

noise (not 12 db as state 

therefore, there was facil 


An, and E. Rich noticed an error in Eq. (3), which 
Asking noises had been Eenerated independently. In 
ere perfectly correlated (cf. Fig. 1), so their sound 
their combined Power was the square of their summed 
heir squared pressures. When the amount of masking is 


using M = 20 log;o (API/P,), then for intensities 25 db or 
masking noise is about 25 
d on page 128). 
itation (negative masking) instead of masking; listeners 
i audible if presented 
Osman, and Rich; 


and M. V. Mathews Or monaural auditory detection, J. 
acoust. Soc. Am,., 1962, 34, 1842-1853. 


When this Correction is made, of Course, the relation shown in Fig. 8 no longer 
Obtains. 


THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: 
SOME LIMITS ON OUR CAPACITY FOR 
PROCESSING INFORMATION : 


GEORGE A. MILLER 


Harvard University 


My problem is that I have been perse- 
cuted by an integer. For seven years 
this number has followed me around, has 
intruded in my most private data, and 
has assaulted me from the pages of our 
most public journals. This number as- 
sumes a variety of disguises, being some- 
times a little larger and sometimes a 
little smaller than usual, but never 
changing so much as to be unrecogniz- 
able. The persistence with which this 
number plagues me is far more than 
a random accident. There is, to quote 
a famous senator, a design behind it, 
Some pattern governing its appearances. 
Either there really is something unusual 
about the number or else I am suffering 
from delusions of persecution. 

« T shall begin my case history by tell- 
ing you about some experiments that 
tested how accurately people can assign 
numbers to the magnitudes of various 
aspects of a stimulus. In the tradi- 
tional language of psychology these 
would be called experiments in absolute 


1This paper was first read as an Invited 
Address before the Eastern Psychological As- 
Sociation in Philadelphia on April 15, 1955. 
Preparation of the paper was supported by 
the Harvard Psycho-Acoustic Laboratory un- 
der Contract N5ori-76 between Harvard Uni- 
versity and the Office of Naval Research, U. S. 
Navy (Project NR142-201, Report PNR-174). 
Reproduction for any purpose of the U. 5. 
Government is permitted. 


This article appeared in Psychol. 


judgment. Historical accident, how- 
ever, has decreed that they should have 
another name. We now call them ex- 
periments on the capacity of people to 
transmit information. Since these ex- 
periments would not have been done 
without the appearance of information 
theory on the psychological scene, and 
since the results are analyzed in terms 
of the concepts of information theory, 
I shall have to preface my discussion 
with a few remarks about this theory. 


INFORMATION MEASUREMENT 


The “amount of information” is ex- 
actly the same concept that we have 
talked about for years under the name 
of “variance.” The equations are dif- 
ferent, but if we hold tight to the idea 
that anything that increases the vari- 
ance also increases the amount of infor- 
mation we cannot g0 far astray. 

The advantages of this new way 
of talking about variance are simple 
enough. Variance is always stated in 
terms of the unit of measurement— 
inches, pounds, volts, etc.—whereas the 
amount of information is a dimension- 
less quantity. Since the information in 
a discrete statistical distribution does 
not depend upon the unit of measure- 
ment, we can extend the concept to 
situations where we have no metric and 
we would not ordinarily think of using 


Rev., 1956, 63, 81-97. Reprinted with permission. 
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the variance. And it also enables us to 
compare results obtained in quite dif- 
ferent experimental situations where it 
would be meaningless to compare vari- 
ances based on different metrics. So 
there are some good reasons for adopt- 
ing the newer concept. 

The similarity of variance and amount 
of information might be explained this 
way: When we have a large variance, 
we are very ignorant about what is go- 
ing to happen. If we are very ignorant, 
then when we make the observation it 
gives us a lot of information. On the 
other hand, if the variance is very small, 
Wwe know in advance how our observa- 
tion must come out, so we get little in- 
formation from making the observation. 

If you will now imagine a communi- 
cation system, you will realize that 
there is a great deal of variability about 
What goes into the system and also a 
great deal of variability about what 
comes out. The input and the output 
can therefore be described in terms of 
their variance (or their information). 
Tf it is a good communication system, 
however, there must be some system- 
atic relation between what Koes in and 
what comes out. That is to Say, the 
output will depend upon the input, or 
will be correlated with the input. Tf we 
measure this correlation, then we can 
Say how much of the output variance is 
attributable to the input and how much 
is due to random fluctuations or “noise” 
introduced by the system during trans- 
mission. So we see that the measure 
of transmitted information is simply a 
measure of the input-output correlation. 

There are two simple rules to follow. 
Whenever T refer to “amount of in- 
formation,” you will understand “vari- 
ance.” And whenever I refer to “amount 
of transmitted information.” you will 
understand “covariance” or “correla- 
tion.” 

The situation can be described graphi- 
cally by two partially overlapping cir- 
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cles. Then the left circle can be taken 
to represent the variance of the input, 
the right circle the variance of the out- 
put, and the overlap the covariance of 
input and output. I shall speak of the 
left circle as the amount of input infor- 
mation, the right circle as the amount 
of output. information, and the overlap 
as the amount of transmitted informa- 
tion. 

In the experiments on absolute judg- 
ment, the observer is considered to be 
a communication channel. Then the 
left circle would represent the amount 
of information in the stimuli, the right 
circle the amount of information in his 
responses, and the overlap the stimulus- 
response correlation as measured by the 
amount of transmitted information. The 
experimental problem is to increase the 
amount of input information and to 
measure the amount of transmitted in- 
formation. If the observer’s absolute 
judgments are quite accurate, then 
nearly all of the input information will 
be transmitted and will be recoverable 
from his responses. If he makes errors, 
then the transmitted information may 
be considerably less than the input. We 
expect that, as we increase the amount 
of input information, the observer will 
begin to make more and more errors; 
we can test the limits of accuracy of his 
absolute judgments. If the human ob- 
server is a reasonable kind of communi- 
cation system, then when we increase 
the amount of input information the 
transmitted information will increase at 
first and will eventually level off at some 
Asymptotic value. This asymptotic value 
We take to be the channel capacity of 
the observer: it represents the greatest 
Amount of information that he can give 
us about the stimulus on the basis of 
an absolute judgment. The channel ca- 
pacity is the upper limit on the extent 
to which the observer can match his re- 
sponses to the stimuli we give him. { 

Now just a brief word about the bit 
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and we can begin to look at some data. 
One bit of information is the amount of 
information that we need to make a 
decision between two equally likely al- 
ternatives. If we must decide whether 
a man is less than six feet tall or more 
than six feet tall and if we know that 
the chances are 50-50, then we need 
one bit of information. Notice that 
this unit of information does not refer 
in any way to the unit of length that 
we use—feet, inches, centimeters, etc. 
However you measure the man’s height, 
we still need just one bit of information. 

Two bits of information enable us to 
decide among four equally likely alter- 
natives. Three bits of information en- 
able us to decide among eight equally 
likely alternatives. Four bits of infor- 
mation decide among 16 alternatives, 
five among 32, and so on. That is to 
say, if there are 32 equally likely alter- 
natives, we must make five successive 
binary decisions, worth one bit each, be- 
fore we know which alternative is cor- 
rect. So the general rule is simple: 
every time the number of alternatives 
is increased by a factor of two, one bit 
of information is added. 

There are two ways we might in- 
crease the amount of input information. 
We could increase the rate at which we 
give information to the observer, s0 that 
the amount of information per unit time 
would increase. Or we could ignore the 
time variable completely and increase 
the amount of input information by 
increasing the number of alternative 
stimuli. In the absolute judgment ex- 
periment we are interested in the second 
alternative. We give the observer as 
much time as he wants to make his re- 
sponse; we simply increase the number 
of alternative stimuli among which he 
must discriminate and look to see where 
confusions begin to occur. Confusions 
will appear near the point that we are 
calling his “channel capacity. 
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ABSOLUTE JUDGMENTS OF UNI- 
DIMENSIONAL STIMULI 


Now let us consider what happens 
when we make absolute judgments of 
tones. Pollack (17) asked listeners to 
identify tones by assigning numerals to 
them. The tones were different with re- 
spect to frequency, and covered the 
range from 100 to 8000 cps in equal 
logarithmic steps. A tone was sounded 
and the listener responded by giving a 
numeral. After the listener had made 
his response he was told the correct 
identification of the tone. 

When only two or three tones were 
used the listeners never confused them. 
With four different tones confusions 
were quite rare, but with five or more 
tones confusions were frequent. With 
fourteen different tones the listeners 
made many mistakes. 

These data are plotted in Fig. 1. 
Along the bottom is the amount of in- 
put information in bits per stimulus. 
As the number of alternative tones was 
increased from 2 to 14, the input infor- 
mation increased from 1 to 3.8 bits. On 
the ordinate is plotted the amount of 
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Fic. 1. Data from Pollack (17, 18) on the 


amount of information that is transmitted by 
listeners who make absolute judgments of 
auditory pitch. As the amount of input in- 
formation is increased by increasing from 2 
to 14 the number of different pitches to be 
judged, the amount of transmitted informa- 
tion approaches as its upper limit a channel 
capacity of about 2.5 bits per judgment. 
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transmitted information. The amount 
of transmitted information behaves in 
much the way we would expect a com- 
munication channel to behave; the trans- 
mitted information increases linearly up 
to about 2 bits and then bends off to- 
ward an asymptote at about 2.5 bits. 
This value, 2.5 bits, therefore, is what 
we are calling the channel capacity of 
the listener for absolute judgments of 
pitch. 

So now we have the number 2.5 
bits. What does it mean? First, note 
that 2.5 bits corresponds to about six 
equally likely alternatives. The result 
means that we cannot pick more than 
six different pitches that the listener will 
never confuse. Or, stated slightly dif- 
ferently, no matter how many alterna- 
tive tones we ask him to judge, the best 
Wwe can expect him to do is to assign 
them to about six different classes with- 
out error. Or, again, if we know that 
there were N alternative stimuli, then 
his judgment enables us to narrow down 
the particular stimulus to one out of 
N76. 

Most people are surprised that the 
number is as small as six. Of course, 
there is evidence that a musically so- 
phisticated person with absolute pitch 
Can identify accurately any one of 50 
or 60 different pitches. Fortunately, I 
do not have time to discuss these re- 
markable exceptions. I say it is for- 
tunate because I do not know how to 
explain their superior performance. So 
I shall stick to the more pedestrian fact 
that most of us can identify about one 
out of only five or six Pitches before we 
begin to get confused. 

It is interesting to consider that PSy- 
chologists have been using seven-point 
rating scales for a long time, on the 
intuitive basis that trying to rate into 
finer categories does not really add much 
to the usefulness of the ratings. Pol- 
lack’s results indicate that, at least for 
pitches, this intuition is fairly sound. 


READINGS IN MATHEMATICAL PSYCHOLOGY 


TRANSMIT TED INFORMATION 
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Fic. 2. Data from Garner (7) on the chan- 
nel capacity for absolute judgments of audi- 
tory loudness. 


Next you can ask how reproducible 
this result is. Does it depend on the 
spacing of the tones or the various con- 
ditions of judgment? Pollack varied 
these conditions in a number of ways. 
The range of frequencies can be changed 
by a factor of about 20 without chang- 
ing the amount of information trans- 
mitted more than a small percentage. 
Different groupings of the pitches de- 
creased the transmission, but the loss 
Was small. For example, if you CAD. 
discriminate five high-pitched tones in 
One series and five low-pitched tones in 
another series, it is reasonable to ex- 
pect that you could combine all ten into 
a single series and still tell them all 
apart without error. When you try it, 
however, it does not work. The chan- 
nel capacity for pitch seems to be about 
six and that is the best you can do. 

While we are on tones, let us look 
next at Garner’s (7) work on loudness. 
Garner’s data for loudness are sum- 
marized in Fig. 2. Garner went to some 
trouble to get the best possible spacing 
of his tones over the intensity range 
from 15 to 110 db. He used 4, 5, 6,7, 
10, and 20 different stimulus intensities. 
The results shown in Fig. 2 take into 
account the differences among subjects 
and the sequential influence of the im- 
mediately preceding judgment. Again 
we find that there seems to be a limit. 
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Fic. 3. Data from Beebe-Center, Rogers, 
and O’Connell (1) on the channel capacity for 
absolute judgments of saltiness. 


The channel capacity for absolute judg- 
ments of loudness is 2.3 bits, or about 
five perfectly discriminable alternatives. 
Since these two studies were done in 
different laboratories with slightly dif- 
ferent techniques and methods of analy- 
Sis, we are not in a good position to 
argue whether five loudnesses is signifi- 
cantly different from six pitches. Prob- 
ably the difference is in the right direc- 
tion, and absolute judgments of pitch 
are slightly more accurate than absolute 
judgments of loudness. The important 
point, however, is that the two answers 
are of the same order of magnitude. 
The experiment has also been done 
for taste intensities. In Fig. 3 are the 
results obtained by Beebe-Center, Rog- 
ers, and O’Connell (1) for absolute 
Judgments of the concentration of salt 
solutions. The concentrations ranged 
from 0.3 to 34.7 gm. NaCl per 100 
cc. tap water in equal subjective steps. 
They used 3, 5, 9, and 17 different con- 
The channel capacity is 


centrations. ity 
1.9 bits, which is about four distinct 
concentrations. Thus taste intensities 


jive than audi- 


seem a little less distinct! 
he order of 


tory stimuli, but again t 
magnitude is not far off. 
On the other hand, the channel ca- 


pacity for judgments of visual position 
seems to be significantly larger. Hake 
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and Garner (8) asked observers to in- 
terpolate visually between two scale 
markers. Their results are shown in 
Fig. 4. They did the experiment in 
two ways. In one version they let the 
observer use any number between zero 
and 100 to describe the position, al- 
though they presented stimuli at only 
5, 10, 20, or 50 different positions. The 
results with this unlimited response 
technique are shown by the filled circles 
on the graph. In the other version the 
observers were limited in their re- 
sponses to reporting just those stimu- 
lus values that were possible. That is 
to say, in the second version the num- 
ber of different responses that the ob- 
server could make was exactly the same 
as the number of different stimuli that 
the experimenter might present. The 
results with this limited response tech- 
nique are shown by the open circles on 
the graph. The two functions are so 
similar that it seems fair to conclude 
that the number of responses available 
to the observer had nothing to do with 
the channel capacity of 3.25 bits. 

The Hake-Garner experiment has been 
repeated by Coonan and Klemmer. Al- 
though they have not yet published 
their results, they have given me per- 
mission to say that they obtained chan- 
nel capacities ranging from 3.2 bits for 


Ea) 


Zz 
5S 
= 
এ 
£3 
Oo 
£ 
z 
a 2 
8 
[2 
5 POINTS ON A LINE 
Et oN 
& e Nr =I00 
ai A 1 1 YL 
0 । 2 EJ 4 5 6 


INPUT INFORMATION 


Fic. 4. Data from Hake and Garner (8) 
on the channel capacity for absolute judg- 
ments of the position of a pointer in a linear 
interval. 
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very short exposures of the pointer po- 
sition to 3.9 bits for longer exposures. 
These values are slightly higher than 
Hake and Garner’s, so we must con- 
clude that there are between 10 and 15 
distinct positions along a linear inter- 
val. This is the largest channel ca- 
pacity that has been measured for any 
unidimensional variable. 

At the present time these four experi- 
ments on absolute judgments of simple, 
unidimensional stimuli are all that have 
appeared in the psychological journals. 
However, a great deal of work on other 
Stimulus variables has not yet appeared 
in the journals. For example, Eriksen 
and Hake (6) have found that the 
Channel capacity for judging the sizes 
of squares is 2.2 bits, or about five 
Categories, under a wide range of ex- 
perimental conditions. In a separate 
experiment Eriksen (5) found 2.8 bits 
for size, 3.1 bits for hue, and 2.3 bits 
for brightness. Geldard has measured 
the channel capacity for the skin by 
placing vibrators on the chest region. 
A good observer can identify about four 
intensities, about five durations, and 
about seven locations. 

One of the most active groups in this 
area has been the Air Force Operational 
Applications Laboratory. Pollack has 
been kind enough to furnish me with 
the results of their measurements for 
Several aspects of visual displays. They 
made measurements for area and for 
the curvature, length, and direction of 
lines. In one set of experiments they 
used a very short exposure of the stimu- 
lus—Yio second—and then they re- 
peated the measurements with a 5- 
second exposure. For area they got 
2.6 bits vith the short exposure and 
2.7 bits with the long exposure. For 
the length of a line they got about 2.6 
bits with the short exposure and about 
3.0 bits with the long exposure. Direc- 
tion. or angle of inclination, gave 2.8 
bits for the short exposure and 3.3 bits 
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for the long exposure. Curvature was 
apparently harder to judge. When the 
length of the arc was constant, the re- 
Sult at the short exposure duration was 
2.2 bits, but when the length of the 
chord was constant, the result was only 
1.6 bits. This last value is the lowest 
that anyone has measured to date. I 
should add, however, that these values 
are apt to be slightly too low because 
the data from all subjects were pooled 
before the transmitted information was 
computed. 

Now let us see where we are. First, 
the channel capacity does seem to be a 
valid notion for describing human ob- 
Servers. Second, the channel capacities 
measured for these unidimensional vari- 
ables range from 1.6 bits for curvature 
to 3.9 bits for positions in an interval. 
Although there is no question that the 
differences among the variables are real 
and meaningful, the more impressive 
fact to me is their considerable simi- 
larity. If I take the best estimates I 
can get of the channel capacities for all 
the stimulus variables I have mentioned, 
the mean is 2.6 bits and the standard 
deviation is only 0.6 bit. In terms of 
distinguishable alternatives, this mean 
corresponds to about 6.5 categories, one 
standard deviation includes from 4 to 
10 categories, and the total range is 
from 3 to 15 categories. Considering 
the wide Variety of different variables 
that have been studied, T find this to 
be a remarkably narrow range. 

There seems to be some limitation 
built into us either by learning or by 
the design of our nervous systems, a 
limit that keeps our channel capacities 
in this general range. On the basis of 
the present evidence it seems safe to 
say that we possess a finite and rather 
small capacity for making such unidi- 
mensional judgments and that this ca- 
pacity does not vary a great deal from 
one simple sensory attribute to another. 
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ABSOLUTE JUDGMENTS OF MULTI- 
DIMENSIONAL STIMULI 


You may have noticed that I have 
been careful to say that this magical 
number seven applies to one-dimensional 
judgments. Everyday experience teaches 
us that we can identify accurately any 
one of several hundred faces, any one 
of several thousand words, any one of 
several thousand objects, etc. The story 
certainly would not be complete if we 
stopped at this point. We must have 
some understanding of why the one- 
dimensional variables we judge in the 
laboratory give results so far out of 
line with what we do constantly in our 
behavior outside the laboratory. A pos- 
sible explanation lies in the number of 
independently variable attributes of the 
stimuli that are being judged. Objects, 
faces, words, and the like differ from 
one another in many ways, whereas the 
simple stimuli we have considered thus 
far differ from one another in only one 
respect. 

Fortunately, there are a few data on 
What happens when we make absolute 
judgments of stimuli that differ from 
one another in several ways. Let us 
look first at the results Klemmer and 
Frick (13) have reported for the abso- 
lute judgment of the position of a dot 
in a square, In Fig. 5 We see their re- 
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Fic. 5. Data from Klemmer and Frick (13) 
on the channel capacity for absolute judg- 
ments of the position of a dot in a square. 
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sults. Now the channel capacity seems 
to have increased to 4.6 bits, which 
means that people can identify accu- 
rately any one of 24 positions in the 
square. 

The position of a dot in a square is 
clearly a two-dimensional proposition. 
Both its horizontal and its vertical po- 
sition must be identified. Thus it seems 
natural to compare the 4.6-bit capacity 
for a square with the 3.25-bit capacity 
for the position of a point in an inter- 
val. The point in the square requires 
two judgments of the interval type. Ti 
we have a capacity of 3.25 bits for esti- 
mating intervals and we do this twice, 
we should get 6.5 bits as our capacity 
for locating points in a square. Adding 
the second independent dimension gives 
us an increase from 3.25 to 4.6, but it 
falls short of the perfect addition that 
would give 6.5 bits. 

Another example is provided by Beebe- 
Center, Rogers, and O'Connell. When 
they asked people to identify both the 
saltiness and the sweetness of solutions 
containing various concentrations of salt 
and sucrose, they found that the chan- 
nel capacity was 2.3 bits. Since the ca- 
pacity for salt alone was 1.9, we might 
expect about 3.8 bits if the two aspects 
of the compound stimuli were judged 
independently. As with spatial loca- 
tions, the second dimension adds a little 
to the capacity but not as much as it 
conceivably might. 

A third example is provided by Pol- 
lack (18), who asked listeners to judge 
both the loudness and the pitch of pure 
tones. Since pitch gives 2.5 bits and 
loudness gives 2.3 bits, we might hope 
to get as much as 4.8 bits for pitch and 
loudness together. Pollack obtained 3.1 
bits, which again indicates that the 
second dimension augments the channel 
capacity but not so much as it might. 

A fourth example can be drawn from 
the work of Halsey and Chapanis (9) 
on confusions among colors of equal 
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luminance. Although they did not ana- 
lyze their results in informational terms, 
they estimate that there are about 11 to 
15 identifiable colors, or, in our terms, 
about 3.6 bits. Since these colors varied 
in both hue and saturation, it is prob- 
ably correct to regard this as a two- 
dimensional judgment. If we compare 
this with Eriksen’s 3.1 bits for hue 
(which is a questionable comparison to 
draw), we again have something less 
than perfect addition when a second 
dimension is added. 

It is still a long way, however, from 
these two-dimensional examples to the 
multidimensional stimuli provided by 
faces, words, etc. To fill this gap we 
have only one experiment, an auditory 
study done by Pollack and Ficks (19). 
They managed to get six different acous- 
tic variables that they could change: 
frequency, intensity, rate of interrup- 
tion, on-time fraction, total duration, 
and spatial location. Each one of these 
Six variables could assume any one of 
five different values, so altogether there 
were 5°, or 15,625 different tones that 
they could present. The listeners made 
4 separate rating for each one of these 
six dimensions. Under these conditions 
the transmitted information Was 7.2 bits, 
which corresponds to about 150 differ- 
ent categories that could be absolutely 
identified without error. Now we are 
beginning to Bet up into the range that 
Ordinary experience would lead us to 
expect. 

Suppose that we plot these data, 
fragmentary as they are, and make a 
Suess about how the channel capacity 
changes with the dimensionality of the 
stimuli. The result is given in Fig. 6. 
In a moment of considerable daring I 
sketched the dotted line to indicate 
roughly the trend that the data seemed 
to be taking. 

Clearly, the addition of independently 
variable attributes to the stimulus in- 
creases the channel Capacity, but at a 
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Fic. 6. The general form of the relation be- 
tween channel capacity and the number of in- 
dependently variable attributes of the stimuli. 


decreasing rate. It is interesting to 
note that the channel capacity is in- 
creased even when the several variables 
are not independent. Eriksen (5) re- 
ports that, when size, brightness, and 
hue all vary together in perfect correla- 
tion, the transmitted information is 4.1 
bits as compared with an average of 
about 2.7 bits when these attributes are 
Varied one at a time. By confounding 
three attributes, Eriksen increased the 
dimensionality of the input without in- 
creasing the amount of input informa- 
tion; the result was an increase in chan- 
nel capacity of about the amount that 
the dotted function in Fig. 6 would lead 
us to expect. 

The point seems to be that, as we 
add more variables to the display, we 
increase the total capacity, but we de- 
crease the accuracy for any particular 
variable. In other words, we can make 
relatively crude judgments of several 
things simultaneously. 

We might argue that in the course of 
evolution those organisms were most 
Successful that were responsive to the 
widest range of stimulus energies in 
their environment. In order to survive 
in a constantly fluctuating world, it was 
better to have a little information about 
a lot of things than to have a lot of in- 
formation about a small segment of the 
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environment. If a compromise was nec- 
essary, the one we seem to have made is 
clearly the more adaptive. 

Pollack and Ficks’s results are very 
strongly suggestive of an argument that 
linguists and phoneticians have been 
making for some time (11). According 
to the linguistic analysis of the sounds 
of human speech, there are about eight 
‘ or ten dimensions—the linguists call 
them distinctive features—that distin- 
guish one phoneme from another. These 
distinctive features are usually binary, 
or at most ternary, in nature. For ex- 
ample, a binary distinction is made be- 
tween vowels and consonants, a binary 
decision is made between oral and nasal 
consonants, a ternary decision is made 
among front, middle, and back pho- 
nemes, etc. This approach gives us 
quite a different picture of speech per- 
ception than we might otherwise obtain 
from our studies of the speech spectrum 
and of the ear’s ability to discriminate 
relative differences among pure tones. 
I am personally much interested in this 
new approach (15), and I regret that 
there is not time to discuss it here. 

It was probably with this linguistic 
theory in mind that Pollack and Ficks 
conducted a test on a set of tonal 
stimuli that varied in eight dimensions, 
but required only a binary decision on 
each dimension. With these tones they 
measured the transmitted information 
at 6.9 bits, or about 120 recognizable 
kinds of sounds. It is an intriguing 
question, as yet unexplored, whether 
one can go on adding dimensions in- 
definitely in this way. 

In human speech there is clearly a 
limit to the number of dimensions that 
we use. In this instance, however, it 1s 
not known whether the limit is imposed 
by the nature of the perceptual ma- 
chinery that must recognize the sounds 
or by the nature of the speech ma- 
Cchinery that must produce them. Some- 
body will have to do the experiment to 
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find out. There is a limit, however, at 
about eight or nine distinctive features 
in every language that has been studied, 
and so when we talk we must resort to 
still another trick for increasing our 
channel capacity. Language uses se- 
quences of phonemes, so we make sev- 
eral judgments successively when we 
listen to words and sentences. That is 
to say, we use both simultaneous and 
successive discriminations in order to 
expand the rather rigid limits imposed 
by the inaccuracy of our absolute judg- 
ments of simple magnitudes. 

These multidimensional judgments are 
strongly reminiscent of the abstraction 
experiment of Kiilpe (14). As you may 
remember, Kiilpe showed that observers 
report more accurately on an attribute 
for which they are set than on attributes 
for which they are not set. For exam- 
ple, Chapman (4) used three different 
attributes and compared the results ob- 
tained when the observers were in- 
structed before the tachistoscopic pres- 
entation with the results obtained when 
they were not told until after the pres- 
entation which one of the three attri- 
butes was to be reported. When the 
instruction was given in advance, the 
judgments were more accurate. When 
the instruction was given afterwards, 
the subjects presumably had to judge all 
three attributes in order to report on 
any one of them and the accuracy was 
correspondingly lower. This is in com- 
plete accord with the results we have 
just been considering, where the ac- 
curacy of judgment on each attribute 
decreased as more dimensions were 
added. The point is probably obvious, 
but I shall make it anyhow, that the 
abstraction experiments did not demon- 
strate that people can judge only one 
attribute at a time. They merely showed 
what seems quite reasonable, that peo- 
ple are less accurate if they must judge 
more than one attribute simultaneously. 
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SUBITIZING 


I cannot leave this general area with- 
out mentioning, however briefly, the ex- 
periments conducted at Mount Holyoke 
College on the discrimination of num- 
ber (12). In experiments by Kaufman, 
Lord, Reese, and Volkmann random 
patterns of dots were flashed on a screen 
for VY; of a second. Anywhere from 1 
to more than 200 dots could appear in 
the pattern. The subject’s task was to 
report how many dots there were. 

The first point to note is that on pat- 
terns containing up to five or six dots 
the subjects simply did not make errors. 
The performance on these small num- 
bers of dots was so different from the 
performance with more dots that it was 
given a special name. Below seven the 
subjects were said to subitize; above 
seven they were said to estimate. This 
is, as you will recognize, what we once 


optimistically called “the span of atten- 
tion.” 


This discontinuity at seven is, of 
course, suggestive. Is this the same 
basic process that limits our unidimen- 
sional judgments to about seven cate- 
Sories? The generalization is tempting, 
but not sound in my opinion. The data 
on number estimates have not been ana- 
lyzed in informational terms; but on 
the basis of the published data I would 
Buess that the subjects transmitted 
something more than four bits of in- 
formation about the number of dots. 
Using the same arguments as before. we 
would conclude that there are about 20 
Or 30 distinguishable categories of nu- 
Merousness. This is considerably more 
information than we would expect to 
zet from a unidimensional display. It 
1S, as a matter of fact, very much like a 
two-dimensional display. Although the 
dimensionality of the random dot pat- 
terns is not entirely clear, these results 
are in the same range as Klemmer and 
Frick’s for their two-dimensional dis- 
play of dots in a square. Perhaps the 
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two dimensions of numerousness are 
area and density. When the subject 
can subitize, area and density may not 
be the significant variables, but when 
the subject must estimate perhaps they 
are significant. In any event, the com- 
parison is not so simple as it might 
seem at first thought. 

This is ane of the ways in which the 
magical number seven has persecuted 
me. Here we have two closely related 
kinds of experiments, both of which 
point to the significance of the number 
seven as a limit on our capacities. And 
yet when we examine the matter more 
closely, there seems to be a reasonable 
suspicion that it is nothing more than 
a coincidence. 


THE SPAN OF IMMEDIATE MEMORY 


Let me summarize the situation in 
this way. There is a clear and definite 
limit to the accuracy with which we can 
identify absolutely the magnitude of 
a unidimensional stimulus variable. I 
would propose to call this limit the 
spun of absolute judgment, and T 
maintain that for unidimensional judg- 
ments this span is usually somewhere 
in the neighborhood of seven. We are 
not completely at the mercy of this 
limited span, however, because we have 
a variety of techniques for getting 
around it and increasing the accuracy 
of our judgments. The three most im- 
portant of these devices are (a) to 
make relative rather than absolute judg- 
ments; or, if that is not possible, (b) 
to increase the number of dimensions 
along which the stimuli can differ; or 
(c) to arrange the task in such a way 
that we make a sequence of several ab- 
solute judgments in a row. | 

The study of relative judgments ls 
one of the oldest topics in experimental 
Psychology, and TI will not pause to re- 
view it now. The second device, in- 
creasing the dimensionality, we have just 
considered. It seems that by adding 
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more dimensions and requiring crude, 
binary, yes-no judgments on each at- 
tribute we can extend the span of abso- 
lute judgment from seven to at least 
150. Judging from our everyday be- 
havior, the limit is probably in the 
thousands, if indeed there is a limit. In 
my opinion, we cannot go on compound- 
ing dimensions indefinitely. I suspect 
that there is also a span of perceptual 
dimensionality and that this span is 
somewhere in the neighborhood of ten, 
but I must add at once that there is no 
Objective evidence to support this sus- 
Picion. This is a question sadly need- 
ing experimental exploration. 

Concerning the third device, the use 
of successive judgments, I have quite a 
bit to say because this device introduces 
memory as the handmaiden of discrimi- 
nation. And, since mnemonic processes 
are at least as complex as are perceptual 
Processes, we can anticipate that their 
interactions will not be easily disen- 
tangled. 

Suppose that we start by simply ex- 
tending slightly the experimental pro- 
cedure that we have been using. Up 
to this point we have presented a single 
stimulus and asked the observer to name 
it immediately thereafter. We can ex- 
tend this procedure by requiring the ob- 
Server to withhold his response until we 
have given him several stimuli in suc- 
cession. At the end of the sequence of 
stimuli he then makes his response. We 
Still have the same sort of input-out- 
Put situation that is required for the 
measurement of transmitted informa- 
tion. But now we have passed from 
an experiment on absolute judgment to 
what is traditionally called an experi- 
ment on immediate memory. g 

Before we look at any data on this 
topic I feel I must give you 4 word of 
Warning to help you avoid some obvi- 
Ous associations that can be confusing. 
Everybody knows that there is a finite 
span of immediate memory and that for 
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a lot of different kinds of test materials 
this span is about seven items in length. 
I have just shown you that there is a 
span of absolute judgment that can dis- 
tinguish about seven categories and that 
there is a span of attention that will 
encompass about six objects at a glance. 
What is more natural than to think that 
all three of these spans are different as- 
pects of a single underlying process? 
And that is a fundamental mistake, as 
I shall be at some pains to demonstrate. 
This mistake is one of the malicious 
persecutions that the magical number 
seven has subjected me to. 

My mistake went something like this. 
We have seen that the invariant fea- 
ture in the span of absolute judgment 
is the amount of information that the 
observer can transmit. There is a real 
operational similarity between the ab- 
solute judgment experiment and the 
immediate memory experiment. Tf im- 
mediate memory is like absolute judg- 
ment, then it should follow that the in- 
variant feature in the span of immediate 
memory is also the amount of informa- 
tion that an observer can retain. If the 
amount of information in the span of 
immediate memory is a constant, then 
the span should be short when the indi- 
vidual items contain a lot of informa- 
tion and the span should be long when 
the items contain little information. For 
example, decimal digits are worth 3.3 
bits apiece. We can recall about seven 
of them, for a total of 23 bits of in- 
formation. Isolated English words are 
worth about 10 bits apiece. Tf the total 
amount of information is to remain 
constant at 23 bits, then we should be 
able to remember only two or three 
words chosen at random. In this way 
I generated a theory about how the span 
of immediate memory should vary as a 
function of the amount of information 
per item in the test materials. 

The measurements of memory span in 
the literature are suggestive on this 
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question, but not definitive. And so it 
Was necessary to do the experiment to 
see. Hayes (10) tried it out with five 
different kinds of test materials: binary 
digits, decimal digits, letters of the al- 
Phabet, letters plus decimal digits, and 
with 1,000 monosyllabic words. The 
lists were read aloud at the rate of one 
item per second and the subjects had as 
much time as they needed to give their 
responses. A procedure described by 
Woodworth (20) was used to score the 
responses. 

The results are shown by the filled 
circles in Fig. 7. Here the dotted line 
indicates what the span should have 
been if the amount of information in the 
span were constant. The solid curves 
represent the data. Hayes repeated the 
experiment using test vocabularies of 
different sizes but all containing only 
English monosyllables (open circles in 
Fig. 7). This more homogeneous test 
material did not change the picture Ssig- 
nificantly. With binary items the span 
is about nine and, although it drops to 
about five with monosyllabic English 
words, the difference is far less than 
the hypothesis of constant information 
would require. 
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amount of information retained after one 
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amount of information per item in the test 
materials. 


There is nothing wrong with Hayes’s 
experiment, because Pollack (16) re- 
peated it much more elaborately and 
got essentially the same result. Pol- 
lack took pains to measure the amount 
of information transmitted and did not 
rely on the traditional procedure for 
scoring the responses. His results are 
plotted in Fig. 8. Here it is clear that 
the amount of information transmitted 
is not a constant, but increases almost 
linearly as the amount of information 
per item in the input is increased. 

And so the outcome is perfectly clear. 
In spite of the coincidence that the 
magical number seven appears in both 
Places, the span of absolute judgment 
and the span of immediate memory are 
quite different kinds of limitations that 
are imposed on our ability to process 
information. Absolute judgment is lim- 
ited by the amount of information. Im- 
mediate memory is limited by the num- 
ber of items. In order to capture this dis- 
tinction in somewhat picturesque terms, 
I have fallen into the custom of distin- 
guishing between bits of information 
and chunks of information. Then I can 
say that the number of bits of informa- 
tion is constant for absolute judgment 
and the number of chunks of informa- 
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tion is constant for immediate memory. 
The span of immediate memory seems 
to be almost independent of the number 
of bits per chunk, at least over the 
range that has been examined to date. 

The contrast of the terms bit and 
chunk also serves to highlight the fact 
that we are not very definite about what 
constitutes a chunk of information. For 
example, the memory span of five words 
that Hayes obtained when each word 
was drawn at random from a set of 1000 
English monosyllables might just as ap- 
Propriately have been called a memory 
span of 15 phonemes, since each word 
had about three phonemes in it. Intui- 
tively, it is clear that the subjects were 
recalling five words, not 15 phonemes, 
but the logical distinction is not im- 
mediately apparent. We are dealing 
here with a process of organizing or 
grouping the input into familiar units 
or chunks, and a great deal of learning 
has gone into the formation of these 
familiar units. 


RECODING 


In order to speak more precisely, 
therefore, we must recognize the impor- 
tance of grouping or organizing the in- 
Put sequence into units or chunks. 
Since the memory span is a fixed num- 
ber of chunks, we can increase the num- 
ber of bits of information that it con- 
tains simply by building larger and 
larger chunks, each chunk containing 
more information than before. 

A man just beginning to learn radio- 
telegraphic code hears each dit and dah 
as a separate chunk. Soon he is able 
to organize these sounds into letters and 
then he can deal with the letters as 
chunks. Then the letters organize 
themselves as words, which are still 
larger chunks, and he begins to hear 
Whole phrases. I do not mean that each 
Step is a discrete process, Or that pla- 
teaus must appear in his learning curve, 
for surely the levels of organization are 
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achieved at different rates and overlap 
each other during the learning process. 
I am simply pointing to the obvious 
fact that the dits and dahs are organ- 
ized by learning into patterns and that 
as these larger chunks emerge the 
amount of message that the operator 
can remember increases correspondingly. 
In the terms I am proposing to use, the 
operator learns to increase the bits per 
chunk. 

In the jargon of communication the- 
ory, this process would be called recod- 
ing. The input is given in a code that 
contains many chunks with few bits per 
chunk. The operator recodes the input 
into another code that contains fewer 
chunks with more bits per chunk. There 
are many ways to do this recoding, but 
probably the simplest is to group the 
input events, apply a new name to the 
group, and then remember the new name 
rather than the original input events. 

Since I am convinced that this proc- 
ess is a very general and important one 
for psychology, I want to tell you about 
a demonstration experiment that should 
make perfectly explicit what I am talk- 
ing about. This experiment was con- 
ducted by Sidney Smith and was re- 
ported by him before the Eastern Psy- 
chological Association in 1954. 

Begin with the observed fact that peo- 
ple can repeat back eight decimal digits, 
but only nine binary digits. Since there 
is a large discrepancy in the amount of 
information recalled in these two cases, 
we suspect at once that a recoding pro- 
cedure could be used to increase the 
span of immediate memory for binary 
digits. In Table 1 a method for group- 
ing and renaming is illustrated. Along 
the top is a sequence of 18 binary digits, 
far more than any subject was able to 
recall after a single presentation. In 
the next line these same binary digits 
are grouped by pairs. Four possible 
pairs can occur: 00 is renamed 0, 01 is 
renamed 1, 10 is renamed 2, and 11 is 
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TABLE 1 


WaAys OF RECODING SEQUE: 


S OF BINARY DicGITs 


Binary Digits (Bits) 10100 


GUIS ETI OLGDSLIEIIG 


2:1 Chunks 10 10 


00 10 01 tL O00 11 10 


Recoding 2 2 0 2 1 3 0 3 2 
3:1 Chunks 101 000 100 111 O01 110 
Recoding 5 4 7 1 0 
4:1 Chunks 1010 0010 oll OO11 10 
Recoding 10 Fd 7 3 
5:1 Chunks 10100 01001 11001 110 
Recoding 20 9 25 


renamed 3. That is to say, we recode 
from a base-two arithmetic to a base- 
four arithmetic. In the recoded  se- 
quence there are now just nine digits to 
remember, and this is almost within the 
span of immediate memory. In the next 
line the same sequence of binary digits 
is regrouped into chunks of three. There 
are eight possible sequences of three, so 
we give each sequence a new name be- 
tween 0 and 7. Now we have recoded 
from a sequence of 18 binary digits 
into a sequence of 6 octal digits, and 
this is well within the span of immedi- 
ate memory. In the last two lines the 
binary digits are grouped by fours and 
by fives and are given decimal-digit 
names from 0 to 15 and from 0 to 31. 

It is reasonably obvious that this kind 
of recoding increases the bits per chunk, 
and packages the binary sequence into 
a form that can be retained within the 
span of immediate memory. So Smith 
assembled 20 subjects and measured 
their spans for binary and octal digits. 
The spans were 9 for binaries and 7 for 
octals. Then he gave each recoding 
scheme to five of the subjects. They 
studied the recoding until they said 
they understood it for about 5 or 10 
minutes. Then he tested their span for 
binary digits again while they tried to 
11se the recoding schemes they had 
studied. 


The recoding schemes increased their 
span for binary digits in every case. 
But the increase was not as large as we 
had expected on the basis of their span 
for octal digits. Since the discrepancy 
increased as the recoding ratio increased, 
we reasoned that the few minutes the 
subjects had spent learning the recod- 
ing schemes had not been sufficient. 
Apparently the translation from one 
code to the other must be almost auto- 
matic or the subject will lose part of the 
next group while he is trying to remem- 
ber the translation of the last group. 

Since the 4:1 and 5:1 ratios require 
considerable study, Smith decided to 
imitate Ebbinghaus and do the experi- 
ment on himself. With Germanic pa- 
tience he drilled himself on each recod- 
ing successively, and obtained the re- 
sults shown in Fig. 9. Here the data 
follow along rather nicely with the re 
sults you would predict on the basis of 
his span for octal digits. He could re- 
member 12 octal digits. With the 2:1 
recoding, these 12 chunks were worth 
24 binary digits. With the 3:1 recod- 
ing they were worth 36 binary digits. 
With the 4:1 and 5:1 recodings, they 
were worth about 40 binary digits. 

It is a little dramatic to watch a per- 
Son get 40 binary digits in a row and 
then repeat them back without error. 
However, if you think of this merely as 
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binary digits is plotted as a function of the 
recoding procedure used. The predicted func- 
tion is obtained by multiplying the span for 
octals by 2, 3 and 3.3 for recoding into base 
4, base 8, and base 10, respectively. 


a mnemonic trick for extending the 
memory span, you will miss the more 
important point that is implicit in 
nearly all such mnemonic devices. The 
point is that recoding is an extremely 
Powerful weapon for increasing the 
amount of information that we can 
deal with. In one form or another we 
use recoding constantly in our daily 
behavior. 

In my opinion the most customary 
kind of recoding that we do all the time 
is to translate into a verbal code. When 
there is a story or an argument or an 
idea that we want to remember, we usu- 
ally try to rephrase it “in our own 
words.” When we witness some event 
we want to remember, we make a verbal 
description of the event and then re- 
member our verbalization. Upon recall 
We recreate by secondary elaboration 
the details that seem consistent with 
the particular verbal recoding we hap- 
pen to have made. The well-known ex- 
periment by Carmichael, Hogan, and 
Walter (3) on the influence that names 
have on the recall of visual figures is 
One demonstration of the process. 

The inaccuracy of the testimony of 
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eyewitnesses is well known in legal psy- 
chology, but the distortions of testi- 
mony are not random—they follow 
naturally from the particular recoding 
that the witness used, and the particu- 
lar recoding he used depends upon his 
whole life history. Our language is tre- 
mendously useful for repackaging ma- 
terial into a few chunks rich in infor- 
mation. I suspect that imagery is a 
form of recoding, too, but images seem 
much harder to get at operationally and 
to study experimentally than the more 
symbolic kinds of recoding. 

It seems probable that even memori- 
zation can be studied in these terms. 
The process of memorizing may be sim- 
ply the formation of chunks, or groups 
of items that go together, until there 
are few enough chunks so that we can 
recall all the items. The work by Bous- 
field and Cohen (2) on the occurrence 
of clustering in the recall of words is 
especially interesting in this respect. 


SUMMARY 


I have come to the end of the data 
that I wanted to present, so I would 
like now to make some summarizing re- 
marks. 

First, the span of absolute judgment 
and the span of immediate memory im- 
pose severe limitations on the amount 
of information that we are able to re- 
ceive, process, and remember. By or- 
ganizing the stimulus input simultane- 
ously into several dimensions and suc- 
cessively into a sequence of chunks, we 
manage to break (or at least stretch) 
this informational bottleneck. 

Second, the process of recoding is a 
very important one in human psychol- 
ogy and deserves much more explicit at- 
tention than it has received. In par- 
ticular, the kind of linguistic recoding 
that people do seems to me to be the 
very lifeblood of the thought processes. 
Recoding procedures are a constant 
concern to clinicians, social psycholo- 
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gists, linguists, and anthropologists and 
yet, probably because recoding is less 
accessible to experimental manipulation 
than nonsense syllables or T mazes, the 
traditional experimental psychologist has 
contributed little or nothing to their 
analysis. Nevertheless, experimental 
techniques can be used, methods of re- 
coding can be specified, behavioral in- 
dicants can be found. And I anticipate 
that we will find a very orderly set of 
relations describing what now seems an 
uncharted wilderness of individual dif- 
ferences. 

Third, the concepts and measures 
provided by the theory of information 
provide a quantitative way of getting at 
some of these questions. The theory 
provides us with a yardstick for cali- 
brating our stimulus materials and for 
measuring the performance of our sub- 
jects. In the interests of communica- 
tion I have suppressed the technical de- 
tails of information measurement and 
have tried to express the ideas in more 
familiar terms; I hope this paraphrase 
will not lead you to think they are not 
useful in research. Informational con- 
cepts have already proved valuable in 
the study of discrimination and of lan- 
guage; they promise a great deal in the 
study of learning and memory; and it 
has even been Proposed that they can 
be useful in the study of concept for- 
mation. A lot of questions that seemed 
fruitless twenty or thirty years ago may 
now be worth another look. In fact, I 
feel that my story here must stop just 
as it begins to get really interesting. 

And finally, what about the magical 
number seven? What about the seven 
Wonders of the World, the seven seas, 
the seven deadly sins, the seven daugh- 
ters of Atlas in the Pleiades, the seven 
ages of man, the seven levels of hell, 
the seven Primary colors, the seven notes 
of the musical scale, and the seven days 
of the week? What about the seven- 
point rating scale, the seven Categories 
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for absolute judgment, the seven ob- 
jects in the span of attention, and the 
seven digits in the span of immediate 
memory? For the present I propose to 
withhold judgment. Perhaps there is 
something deep and profound behind all 
these sevens, something just calling out 
for us to discover it. But I suspect 
that it is only a pernicious, Pythagorean 
coincidence. 


REFERENCES 


1. BEEBE-CENTER, J. G., RoGERs, M. S., & 
O'CoxxELL, D. N. Transmission of in- 
formation about sucrose and saline solu- 
tions through the sense of taste. J. 
Psychol., 1955, 39, 157-160. 

2. BovusrieLD, W. A., & Cone, B. H. The 
occurrence of clustering in the recall of 
randomly arranged words of different 
frequencies-of-usage. J. gen. Psychol, 
1955, 52, 83-95. 

3. CARMICHAEL, L., HocAN, H. P., & WALTER, 
A. A. An experimental study of the 
effect of language on the reproduction 
of visually perceived form. J. exp. 
Psychol, 1932, 15, 73-86. 

4. CHAPMAN, D. W. Relative effects of de- 
terminate and indeterminate Aufgaben. 
Amer. J. Psychol., 1932, 44, 163-174. 

5. ERIxsEx, C. W. Multidimensional stimu- 
lus differences and accuracy of discrimi- 
nation. USAF, WADC Tech. Rep, 
1954, No. 54-165. 

6. Erksex, C. W., & Hake, H. W. Abso- 
lute judgments as a function of the 
stimulus range and the number of 
stimulus and response categories. 
exp. Psychol, 1955, 49, 323-332. 

7. GARNER, W. R. An informational analy- 
sis of absolute judgments of loudness. 
J. exp. Psychol, 1953, 46, 373-380. 

8. HAKE, H. W., & GARNER, W. R. The ef- 
fect of presenting various numbers of 
discrete steps on scale reading accuracy. 
J. exp. Psychol, 1951, 42, 358-366. 

9. HALSEY, R. M., & CHAPANIS, A. Chro- 
maticity-confusion contours in a com- 
plex viewing situation. J. Opt. Soc. 
Amer., 1954, 44, 442-454. 

10. Haves, J. R. M. Memory span for sev- 
eral vocabularies as a function of vo- 
cabulary size. In Quarterly Progress 
Report, Cambridge, Mass.: Acoustics 
Laboratory, Massachusetts Institute of 
Technology, Jan.-June, 1952. 


11. 


12. 


13. 


14. 


15. 


GEORGE 


JAKoBSoN, R., FANT, C. G. M., & HALLE, 
M. Preliminaries to speech analysis. 
Cambridge, Mass.: Acoustics Labora- 
tory, Massachusetts Institute of Tech- 
nology, 1952. (Tech. Rep. No. 13.) 

KAUFMAN, E. L., LoRD, M. W., REESE, 
T. W., & VOLEMANN, J. The discrimi- 


nation of visual number. Amer. J. 
Psychol., 1949, 62, 498-525. 
KLEMMER, E. T., & FRICK, F. C. Assimi- 


lation of information from dot and 
matrix patterns. J. exp. Psychol., 1953, 
45, 15-19. 

KizPeE, O. Versuche liber Abstraktion. 
Ber. ii. d. I Kongr. f. exper. Psychol., 
1904, 56-68. 

MH1LLER, G. A., & NICELY, P. E. An analy- 
sis of perceptual confusions among some 


A. MILLER 151 


English consonants. J. Acoust. Soc. 
Amer., 1955, 27, 338-352. 

16. PoLLACK, I. The assimilation of sequen- 
tially encoded information. Amer. J. 
Psychol., 1953, 66, 4214435. 

17. PoLtAcK, I. The information of elemen- 
tary auditory displays. J. Acoust. Soc. 
Amer., 1952, 24, 745-749. 

18. PoLLACK, I. The information of elemen- 
tary auditory displays. IIL. J. Acoust. 
Soc. Amer., 1953, 25, 765-769. 

19. POLLACK, L., & FICKS, L. Information of 
elementary multi-dimensional auditory 
displays. J. Acoust. Soc. Amer., 1954, 
26, 155-158. 

20. WooDWOoRTH, R. 5S. Experimental psy- 
chology. New York: Holt, 1938. 


(Received May 4, 1955) 


REMARKS ON THE METHOD OF PAIRED COMPARISONS: 
I. THE LEAST SQUARES SOLUTION ASSUMING 
EQUAL STANDARD DEVIATIONS 
AND EQUAL CORRELATIONS=* 


FREDERICK MOSTELLER 
HARVARD UNIVERSITY 


Thurstone’s Case V of th 
Sumes equal standard deviations of sensa 
stimuli and zero correlations between pair 
It is shown that the assumption 


to a least squares estimate of the 
ion scale. 


1. Introduction. The fundamental notions underlying Thur- 
stone’s method of paired comparisons (4) are these: 

(1) There is a set of stimul 
Jective continuum (a sensation 
urable physical characteristic). 
(2) Each stimulus when presented to 
to a sensation in the individual. 

(3) The distribution Of sensations from a particular stimulus 

for a population of individuals is normal. 

(4) Stimuli are presented in pairs to an individual, thus giv- 

ing rise to a Sensation for each stimulus. The individual com- 

Pares these sensations and reports which is greater. | 


(5) It is possible for these paired Sensations to be correlated. 


(6) Our task is to Space the stimuli (the sensation means), ex- 
cept for a linear transformation. 


i which can be located on a sub- 
scale, usually not having a meas- 


an individual gives rise 


This article appeared in Psychometrika, 1951, 16, 3-9. Reprinted with permission. 
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There are numerous variations of the basic materials used in 
the analysis—for example, we may not have n different individuals, 
but only one individual who makes all comparisons several times; or 
several individuals may make all comparisons several times; the in- 
dividuals need not be people. 

Furthermore, there are “cases” to be discussed—for example, 
shall we assume all the intercorrelations equal, or shall we assume 
them zero? Shall we assume the standard deviations of the sensa- 
tion distributions equal or not? 

The case which has been discussed most fully is known as Thur- 
stone’s Case V. Thurstone has assumed in this case that the stand- 
ard deviations of the sensation distributions are equal and that the 
correlations between pairs of stimulus sensations are zero. We shall 
discuss a standard method of ordering the stimuli for this Case V. 
Case V has been employed quite frequently and seems to fit empirical 
data rather well in the sense of reproducing the original proportions 
of the paired comparison table. The assumption of equal standard 
deviations is a reasonable first approximation. We will not stick to 
the assumption of zero correlations, because this does not seem to be 
essential for Case V. 


2. Ordering Stimuli with Error-Free Data. We assume there 


are a number of objects or stimuli, 0; , O02, :-- , On. These stimuli 


give rise to sensations which lie on a single sensation continuum S. 
If Xi; and X; are single sensations evoked in an individual I by the 
ith and jth stimuli, then we assume Xi; and X; to be jointly normally 
distributed for the population of individuals with 
(85, Bijess00) 
(i=1,2,.--.,1) (1) 
(t= Li 23 30) 


mean of Xi = S; 

variance of Xi = o*(X;) 

correlation of Xi and X;=pii =P 
The marginal distributions of the X;’s appear as in Figure 1. 


=a" 


SX 5; 5; 5S 
FIGURE 1 


ns of the Sensations Produced by the Separate 


The Marginal Distributio 
of the Method of Paired Comparisons. 


Stimuli in Thurstone’s Case V 
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The figure indicates the possibility that X, < X; , even though S; < 5: . 
In fact this has to happen part of the time if we are to build any- 
thing more than a rank-order scale. 

An individual I compares 0; and Oj; and reports whether 
Xi 2 Xj (no ties are allowed). 

We can best see the tenor of the method for ordering the stimuli 
if we first work through the problem in the case of nontallible data. 
For the case of noniallible data we assume we know the true propor- 
tion of the time Xi; exceeds X;, and that the conditions given above 
(1) are exactly fulfilled. 

Our problem is to find the spacing of the stimuli (or the spacing 
of the mean sensations produced by them, the S, ... S, points in Fig- 
ure 1). Clearly we cannot hope to do this except within a linear 
transformation, for the data reported are merely the percentages of 
times Xi; exceeds X;, say Di; . 


Ee) 


1 — [dij — (S;— 5;)]: 
Pii=P(X;>X;)= J e L ls ddi;j (2) 
V2no(d;;) 20:(d;,) 


[) 


Where dij = Xi; — X;, and o°(di;) = 20°(1 — Pp). There will be no 
loss in generality in assigning the scale factor so that 


20°(1—p) =1. (3) 


It is at this point that we depart slightly from Thurstone, who char- 
acterized Case V as having equal Variances and zero correlations. 
However, his derivations only assume the correlations are zero ex- 
plicitly (and artificially), but are carried through implicitly with 
equal correlations (not necessarily zero). Actually this is a great 
easing of conditions. We can readily imagine a set of attitudinal 
items on the same continuum correlated .34, .38, .42, i.e., nearly 
equal. But it is difficult to imagine them all correlated zero with one 
another. Past uses of this method have all benefited from the fact 
that items were not really assumed to be uncorrelated. It was only 
stated that the model assumed the items were uncorrelated, but the 
model was unable to take cognizance of the statement. Guttman (2) 
has noticed this independently. 

With the scale factor 


$ chosen in equation (3), we can rewrite 
equation (2) 
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TL 
Di; = =) er dy. (4) 
V2n (ST) 


; we can solve for —(S;i—S;) by use of a 
normal table of areas. Then if we arbitrarily assign as a location 
parameter S, = 0, we can compute all other 5;. Thus given the pi; 
matrix we can find the S;. The problem with fallible data is more 


complicated. 
3. Paired Comparison Scaling wit 
have fallible data, we have p'i; which ar 
Analogous to equation (4) we have 
UD 1 RS 2 
Pu = e-iv dy, (5) 
V2n 


From (4), given any pi 


h Fallible Data. When we 
e estimates of the true pi; . 


-D'y 


where the D';; are estimates of Di; = SiS; . We merely look up the 
t the matrix of D'ij. We 


normal deviate corresponding to p'i; to ge 
notice further that the D';i; need not be consistent in the sense that 
the Di; were; i.e., 
Di; + D#=5Si— 5S; + S;—S:= Dis, 
does not hold for the D';;. 
We conceive the problem as follows: from the D';i; to construct 
a set of estimates of the S;’s called S’;i, such that 
Sস= SLD — (St = S';)]* isto bea minimum. 


ud 
It will help to indicate another form of solution for nonfallible 


data. One can set up the Si — Sj; matrix: 
MATRIX OF SS; — Sj 


(6) 


T 2 EES n 
1 S,—5S; 5S, — S$: SS: Ss S§,= 5S 
2 S2— 5S; S2— S52 S2— Ss S82 — Sn 
8 S35 — 5 S352 S3— Ss Ss — Sn 
LSS SS: S85 8,5, 
Totals সSi—nS: XS:i—nS: XSi nS সS;— nS, 
Means §—5, S— 5S: §—5: I= 3; 
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Now by setting S, = 0, we get S, = (5 — 5;) (§ — S:), $s = 
(S — 5,) — (§ — 5s), and so on. We will use this plan shortly for 
the S';. 

lf we wish to minimize expression (6) we take the partial de- 
rivative with respect to S';. Since D';i; = —D'j; and S§'; — S§'; = 
—(S'; — Si) and D'ii = 5S’; — 5S’, = 0, we need only concern our- 
selves with the sum of squares from above the main diagonal in the 
D'ij — (S'; — 5S';) matrix, i.e., terms for which 1 < J. Differentiat- 
ing with respect to S'; we get: 


9(>/2) চে ke 
ET EO — Ri Si) — ST (Dig —Ii4RH (7) 
05; i= jis 


(T= 12, 25,0) 


Setting this partial derivative equal to zero we have 


FSF Ss es HS — (LS Sia nS 
i-1 8) 
=D 1— Diy (=1,2,...,n), 
ja i=in 
but D'i, = —D';,,, and D’;i, = 0; this makes the right side of (8) 
> D’;j; Fi > D';; = 5 Dj. 
J=1 j=in1 i=1 
Thus (8) can be written 
2S, —HSUL=ID, MSIL, 2,0), (9) 


i=1 Jt 


. The determinant of the coefficients of the left side of (9) van- 
ishes. This is to be expected because we have only chosen our scale 
and have not assigned a location parameter. There are various ways 
to assign this location parameter, for example, by setting 5° = 0 or 
by setting S', = 0. We choose to set S§', = 0. This means we will 
Measure distances from S',. Then we try the solution (10) which is 


“Egested by the similarity of the left side of (9) to the total col- 
Umn in the matrix of Si — S 


je 


S'i=> D'i/n— > D’;i/n. (10) 
2 2-3 
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Notice that when 141, S'; = 0 and that 


স5:=>XD’;, 

i=1 ৰঃ 
because f 

> > D';; = 0 : 

Ly J 
which happens because every term and its negative appear in this 
double sum. Therefore, substituting (10) in the leit side of (9) we 


have 
3D | 3 Dn SD |= Das a 


i= j=1 j=1 j=l 


which is an identity, and the equations are solved. Of course, any 
linear transformation of the solutions is equally satisfactory. 

The point of this presentation is to provide a background for 
the theory of paired comparisons, to indicate that the assumption of 
Zero correlations is unnecessary, and to show that the customary 
solution to paired comparisons is a least squares solution in the 
sense of condition (6). That this is a least squares solution seems 
not to be mentioned in the literature although it may have been 
known to Horst (3), since he worked closely along these lines. 

This least squares solution is not entirely satisfactory because 
the p';i, tend to zero and unity when extreme stimuli are compared. 
This introduces unsatisfactorily large numbers in the D';; table. This 
difficulty is usually met by excluding all numbers beyond, say, 2.0 
from the table. After a preliminary arrangement of columns so that 


the S';i will be in approximately proper order, the quantity 


স(D'i; —D'ijn)/E 


is computed where the summation is over the k values of 1 for which 
entries appear in both column J and j+1. Then differences between 
such means are taken as the scale separations (see for example Guil- 
ford’s discussion (1) of thé method of paired comparisons). This 
method seems to give reasonable results. The computations for meth- 
ods which take account of the differing variabilities of the p'i; and 
therefore of the D'i; seem to be unmercifully extensive. 

It should also be remarked that this solution is not entirely a 
reasonable one because we really want to check our results against 
the original p'i;. In other words, a more reasonable solution might 
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be one such that once the S'; are computed we can estimate the p'i; 
by pi; , and minimize, say, 


(pi —Dp";ij)? 
or perhaps 


2 (arc sin Vp; — arc sin Vou): 


Such a thing can no doubt be done, but the results of the author’s 


attempts do not seem to differ enough from the results of the present 
method to he worth pursuing. 
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THEORETICAL RELATIONSHIPS AMONG SOME 
MEASURES OF CONDITIONING 


By CONRAD G. MUELLER 
COLUMBIA UNIVERSITY 
Communicated by C. H. Graham, December 10, 1949 


The relationships among the various measures of strength of conditioning 
constitute an important problem for conditioning theory. Many different 
measures have been used.! The measures latency and magnitude are based 
on the occurrence of a single response, while number of responses in extinction 
and the rate of responding in a “free-response” situation are based on more 
than one instance of a response. Probability of response occurrence is an- 
other term that is encountered in the literature; it is used most frequently 
in cases where more than one response is possible (e.g., right and left turns 
ina T maze) and in circumstances when it is possible to compute the fre- 
quency or the percentage of times that a specified response is given. Per- 
centage of response occurrence is taken to be an estimate of the probability 


of obtaining the response. 

Some theoretical formulations are concerne 
strength; others are more inclusive. In only few cases has an attempt been 
made to present a theory of the relation among measures. In most treat- 
ments that consider several measures, the relations among the measures 
are empirically determined. 

The purpose of the present note is to indicate one possible theoretical 
account of the relationships among latency of response, rate of responding 
and the probability of occurrence of a response. The last measure serves 
as the starting point for the discussion and provides the terms in which the 


d with one or two measures of 


other concepts are related. 
tion? in which a rat's responses 


Consider the Skinner bar-pressing situa 
e and at any rate during the period in which the 
Assume that the responses under con- 


ly distributed in time. Let the rate of 
byr. It may then be shown 
terval between two responses 


may occur at any tim! 
animal is in the experimental cage. 
stant testing conditions are random 
occurrence of these responses be represented 
that the probability, P>v of obtaining an in 
greater than t is 

P>,i= et (1) 


where ¢ is the base of Naperian logarithms. The probability of obtaining 


TS 
Pi, = (rT)re—"/nl! (2) 
a statement of the distribution of time intervals 


Equation (1) gives us k 
associated with various rates of responding. For example, for the median 


1 responses in an interval, 


This article appeared in Proc. natl. Acad. Sci., 1950, 36, 123-130. Reprinted with 


permission. 
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time interval P,,is 0.5 and —rtis log, 0.5 or the median { is 0.69.7. Equa- 
tion (2) gives the probability of various numbers of responses within some 
specified time interval. For example, the probability of getting exactly 
one response in an interval, T, is (rT)e-'", The relation between equa- 
tions (1) and (2) is obvious when we consider the probability of getting no 
responses in an interval, T. In this case Prise, 

Equations (1) and (2) permit us to transform a rate measure into a proba- 
bility measure. Since we are dealing with a continuous distribution 
(time), the probability of a response at any particular time is zero, but the 
probability of a response within given time intervals is finite and specifi- 
able. 

Latency usually refers to the time interval between a stimulus and a 
response and thus is not directly considered in the previous development. 
Assume, however, that the stimulus conditions are one determinant of the 
rate of responding, that is, that the rate has different values for different 
stimulus conditions. This assumption is consistent with the discussions by 
Skinner and others who have emphasized the measurement of rate; the 
assumption would presumably be an element 
measure. 


ary requirement for any 


Under the circumstances of the assumption, t may 
cussing latency, since the latter would be the time interval between the 
beginning of the observation period (when a stimulus was presented) and 
the first response. Thus, on the assumption that stimulus conditions are a 
determinant of rate of responding and on the previous assumption that the 
responses are randomly distributed in time, a statement of the rate of re- 
sponding under specified stimulus conditions implies a probability state- 
ment of the delay of length t between the presentation of the stimulus and 
the occurrence of the first response. This statement tells us not only of the 
distribution of latencies but also of the relationship between some repre- 
sentative value, say the median latency, and the rate of responding; for 
example, the probability of a response greater than the median latency, 


lus, is 0.5; and from equation (1) we see that — rina = loge 0.5 or that 
the median latency equals 0.69.7. 


The Preceding development doe 
ditioning but may be incorporate 


be employed in dis- 


s not imply any particular theory of con- 
d into a large class of theories. For ex- 
ample, if the foregoing discussion is combined with a theory that states 
that rate of responding is proportional to the number of responses that re- 
main to be given in extinction, the measure of number of responses in 


extinction is immediately related to our latency and probability terms. In 
other words, if | j 


r= K(N-n), (3) 


] i ; KERN i 
where N is the number of responses in extinction, n is the number of re- 
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sponses already given, and k is a constant, we may substitute RKCN— mn) for 
rin equation (1) and obtain 
Pais e—*N-n)t, (4) 


en be examined for relationships existing among the 
terms n, N,Pandt. In addition to the relationships among latency, rate 
and number of responses in extinction, equation (4) may be used to predict 
the distribution of responses in extinction for a constant strength and the 
distribution of time intervals between responses at various stages of extinc- 


This equation may th 


tion. 
Since the present argument follows mainly from the assumption of a 
random distribution of responses in time, it is of interest to examine data 
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FIGURE 1 
The percentage of inter-response time intervals 
re {is time in seconds. The data 


greater than t, whe I 
are from an experiment with white rats in a bar-press 
situation as described in the text. The line drawn 

through the data is a plot of equation (1). 
for direct evidence of randomness 4s well as for evidence relating to the 
above outlined consequences of randomness. | 
The data in figure 1 were taken from measurements obtained during the 
course of periodic reconditioning. The data represent the responses of a 
single animal during a 90-minute session of ““three-minute" periodic recon- 
ditioning. Within this observation period the rate of responding was 
The question at issue 1s whether the responses in 


ly constant. i 
distributed randomly. Equation (1) states that the proba- 


nterval between responses greater than t ise; 
esponding expressed in the same units ast. In the 


approximatel 
this interval are 
bility of getting an i 
where r is the rate of rf 
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20-minute session, 238 responses were made, 237 time intervals were re- 
corded, and the rate in this session is 0.20 response per second. Thus, 
without direct reference to the distribution of time intervals, theory speci- 
fies the distribution of time intervals between responses uniquely. In this 
case the probability of getting a time interval greater than t (in seconds) is 
e-0.20. The ordinate of figure 1 shows the percentage of the intervals be- 
tween responses that were greater than the various time values specified on 
the abscissa. The solid line through the date in figure 1 represents the 
theoretical function. The data are consistent with the assumption that 
the measured responses occurred randomly in time. 

Although the data of figure 1 may be representative of the agreement be- 
tween data and theory under the conditions specified, certain cases of sys- 
tematic deviations from theory may be noted. One class of deviations, 
for example, may be found in cases where animals show marked “holding” 
behavior, i.e., where the bar is depressed and held down for many seconds. 
Although the “holding” period is not a “refractory” period? in the usual 
sense of the term, it obviously affects the data in a similar way. During 
the “holding” period, the Probability of response occurrence is zero. One 
complicating feature in analyzing responses characterized by “‘holding” is 
the fact that “holding” is of variable length. The data available at present 
do not warrant an extensive treatment of this problem, but the simplicity 
that may result from apparatus changes designed to eliminate the factor of 
“‘holding” and the advantages that may accrue from the additional response 
specification may be shown. 

An example of a distribution showing systematic deviations from theory is 
Shown in figure 2. The computations and plot are similar to those in 
figure 1. The ordinate represents the percentage of intervals between 
responses greater than the specified abscissa values. The solid line is 
theoretical. The constant of the line was determined, as in the case in 
figure 1, directly from the rate of responding without reference to the dis- 
tribution of time intervals. The fit is Obviously poor; the function appears 
sigmoid and asymmetric. 

Let us assume that the analysis leading to equation (1) and applied to 
figure 1 is correct when applied to all portions of the observation period 
except the time spent in “holding.” An additional test may then be 
applied to the data from which figure 2 was obtained. Now we are inter- 
ested in the measurement of the time interval between the end of one re- 
sponse and the beginning of the next.5 Figure 3 shows the results of such 
measurements in the form of a plot of the percentage of intervals between 
the end of one response and the beginning of the next that were greater than 
the specified abscissa values. The solid line through the data is theoretical 
when the rate term, r, is set equal to the ratio of the number of responses to 
the total time minus the “holding’’ time, i.e., to the number of responses 
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owing the.deviation from theory in cases of “holding” be- 


A plot similar to figure 1 sh: 
a is a plot of equation (1). 
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The data of figure 2 “corrected for holding." The plot is similar 
to that in figures 1 and 2, except that the measured interval is the 
time between the end of one response and the beginning of the next 


response. The line drawn through the data is a theoretical one 
described in the text. 
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per unit of “available" time. Asin figures 1 and 2 the constant is evaluated 
independently of the shape of the distribution of intervals. 

Data relevant to the present analysis of latency measures are not numer- 
ous. The agreement between the present theory and the data reported 
by Felsinger, Gladstone, Yamaguchi and Hulls is shown in figure 4, where 
the percentage of latencies greater than specified abscissa values are plotted. 
The solid line is the theoretical curve. In the case of the latency data under 
consideration it is not possible to evaluate r independently of the distribu- 
tion of time intervals. In the case of figure 4 the constant was determined 


by the slope of a straight line fitted to a Plot of log. P,, against t. 


PERCENT OF LATENCIES 
GREATER THAN t 


t (TIME IN SECONDS ) 
FIGURE 4 


The percentage of latencies greater thant. The data are from figure 1 of Felsinger, 
Gladstone, Yamaguchi and Hull.s 

Probably little is to be gained at this time b 
consequences of equation (1), but many additior 
may be made. For some tests appropriate dat 
tests that have been tried the agreement between data and theory is 
promising. One prediction that has been tested concerns the distribution 
of time intervals between responses for a number of animals at comparable 
stages in extinction. The expectation is that at a specified stage in extinc- 
tion the intervals between, say, response R, and Ry; fora large number of 
animals, will be distributed in a manner similar to that shown in figure 1 
and that the constant, 7 (therefore the steepness of drop of the curve) will 
vary systematically witha. In other words, the steepness of the drop of a 
curve such as found in figure 1 will depend on where in extinction the inter- 


Y further sampling of the 
1al tests of the formulation 
# are not available. For the 
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In fact this expectation seems to be borne out by the 


vals are measured. 
at each stage of 


cases measured, although the number of measurements 
extinction is not large. 

Finally, it may be pointed out that the form of the present account has 
important consequences for the treatment of experimental data. Since 
one of the features of the account is the possibility of specifying the fre- 
quency distributions of the measures discussed it is possible to eliminate 
ems associated with the arbitrary selection of representa- 
tive values in summarizing data. On the basis of the preceding equations, 
e changes in one statistic, say the arithmetic mean, in terms of 
the geometric mean or the median. Therefore, 
data using different statistics are made comparable and the multiplicity of 
functions that may arise from the use of different descriptive statistics not 
only ceases to pose a difficult problem but is actually an aid to theory test- 


many of the probl 


one may stat! 
changes in another, say 


ing. 
Summary.—A theoretical account of some relationships among measures 


of strength of conditioning has been considered. (1) If we assume that 
responses in a “free-response"' situation are randomly distributed in time, 
we obtain directly a statement of the probability of occurrence of a re- 
sponse (or of any number of responses) within a specified time interval as a 
function of the length of the interval and of the rate of responding; we also 
obtain a statement of the probability of occurrence of inter-response time 
intervals of varying lengths. (2) If we assume that, for any specified 
stimulus condition, there corresponds some rate of responding, it turns out 
that the probability of occurrence of latencies of various lengths may be 
specified for various rates of responding, or, for a fixed probability value, 
the relation between latency and rate may be specified. (3) Finally, 
where these considerations are added to a theory specifying the relationship 
between rate of responding and number of responses yet to occur, the 
sponses in extinction may be related to the latency and proba- 

{l as to rate. In addition to statements about average 

for the distribution of 


tion has consequences 
and, by extension, for the distribution of 


number of re 
bility terms as we. 
values, the present formula 
time intervals between responses 


latency measures. 


, Principles of Behavior, D. Appleton-Century Co., New York, 1943. 


Hull, C. L. 
2 Skinner, B. F., The Behavior of Organisms, D. Appleton-Century Co., New York, 
1938. 
assume that a “refractory'" period exists, 


3 A slightly different equation results if we 


le., that immediately after x response there is a period during which the probability of 


assume that the transition from the "refractory" 


getting a response is zero. If we tn 
period to randomness is instantaneous, the probability of getting an interval greater than 
lis 
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where to is the “refractory” period. The formulation is more com plex if the transition is 
treated as a gradual one or if the “‘refractory" period has # variable length. 

* The data reported here were recorded by Mr. Michael Kaplan in the Psychological 
Laboratories of Columbia University. 

© This is merely a first approximation. Subsequent analyses may show that the 
interval between the end of one response and the beginning of the next is not independent 
of the “‘holding’’ period. The results of our procedure indicate that the approximation is 
useful for the present. 

$ The experiment by Felsinger, Gladstone, Yamaguchi ind Hull [/. Exptl. Psychol.. 
37, 214-228 (1947) |] may not provide an optimal test of our formulation for two reasons. 
The first is that the data are reported in a frequency distribution with step intervals 
which begin at zero. If the shortest latency were greater than zero, starting the step 
intervals at the lowest measure would be more appropriate. The use of zero as a lower 
limit could easily make an exponential distribution more normal. The method of sum- 
marizing the data may account for the deviation of the point at 0.5 second in figure 4. 
The deviation of this point is an expression of the fact that the distribution reported by 
Felsinger, Gladstone, Yamaguchi and Hull does not have a maximum frequency at the 
first step interval. 

In the second place, it may be’ 
associated with the exposure of 
tinuous ones associated with the presence of the bar. 


cedure of the sort used by Skinner (op. cit.), 
others. After a period of, say, no li 
sponse occurs (Skinner) or Stays on for some 


esponding. The procedure used by 
age of permitting the measurement of the time interval 


between the onset of the stimulus and the first response and the subsequent intervals 
between responses under “the same" stimulus conditions. 


THE THEORY OF SIGNAL DETECTABILITY* 


W. W. PETERSON, T. G. BIRDSALL, AND W. C. Fox 


UNIVERSITY OF MICHIGAN 
ANN ARBOR, MICHIGAN 


treated in this paper is the following: 


Suppose an observer is given a voltage varying with time during a prescribed obser- 
vation interval and is asked to decide whether its source is noise or is signal plus noise. 
What method should the observer use to make this decision, and what receiver is a 


realization of that method? After giving a discussion of theoretical aspects of this prob- 
lem, the paper presents specific derivations of the optimum receiver for a number of 
cases of practical interest. 

The receiver whose output is the value of the likelihood ratio of the input volt- 
age over the observation interval is the answer to the second question no matter which 
of the various optimum methods current in the literature is employed including the Ney- 
man-Pearson observer, Siegert's ideal observer,and Woodward and Davies’ observer." 
An optimum observer required to give a yes or no answer simply chooses an operating 


level and concludes that the receiver input arose from signal plus noise only when this 


level is exceeded by the output of his likelihood ratio receiver. 

Associated with each such operating level are conditional probabilities that the 
answer is a false alarm and the conditional probability of detection. Graphs of these 
quantities, called receiver operating characteristic, or ROC, curves are convenient for 
evaluating a receiver. If the detection problem is changed by varying, for example, the 
signal power, then a family of ROC curves is generated. Such things as betting curves 
can easily be obtained from such a family. The operating level to be used in a particu- 
lar situation must be chosen by the observer. His choice will depend on such factors 
as the permissible false alarm rate, a priori probabilities, and relative importance of 


errors. 


The problem of signal detectability 


as an introduction, attention is devoted 


With these theoretical aspects serving 
d ratio, and for probability of detec- 


to the derivation of explicit formulas for likelihoo | 
tion and probability of false alarm, for a number of particular cases. Stationary, band- 


limited, white Gaussian noise is assumed. The seven special cases which are presented 
were chosen from the simplest problems in signal detection which closely represent 


practical situations. i | | 
Two of the cases form a basis for the best available approximation to the impor- 
tant problem of finding probability of detection when the starting time of the signal, 
signal frequency, or both, are unknown. Furthermore, in these two cases uncertainty In 
the signal can be varied, and a quantitative relationship between uncertainty and 
ability to detect signals is presented for these two rather general cases. The variety of 
examples presented should serve to suggest methods for attacking other simple signal 
detection problems and to give insight into problems too complicated to allow a direct 


solution. 
1. Introduction 
The problem of signal detectability treated in this paper is that of determining a 
set of optimum instructions to ‘be issued tor an: “observer who is given a voltage 
varying with time during a prescribed observation interval and who must judge whether 
its source is noise" or “signal plus noise. + The nature of the “noise” and of the 
“signal plus noise” must be known to some extent by the observer. 


From Trans. IRE Professional 


Reprinted with permission. 
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Any equipment which the observer uses to make this Judgment is called the 
“receiver. Therefore the voltage with which the observer is presented is called the 
“receiver input.” The optimum instructions may consist primarily in specifying 
the “receiver™ to be used by the observer. 

The first three sections of this article Survey the applications of statistical methods 
to this problem of signal detectability. They are intended to serve as an introduction 
to the subject for those who possess a minimum of mathematical training. Several 
definitions of “optimum instructions have been proposed by other authors. Emphasis 
is placed here on the fact that these various definitions lead to essentially the same 
receiver. In subsequent sections the actual specification of the optimum receiver Is 
carried out and its performance is evaluated numerically for some cases of practical 
interest [17]. 


1.1 Population SN and N 


Either noise alone or the signal plus noise may be capable of producing many 
different receiver inputs. The totality of all possible receiver inputs when noise alone 
is present is called “Population N°": similarly, the collection of all receiver inputs when 
signal plus noise is present is called “Population SN." The observer is presented witha 
receiver input from one of the two populations, but he does not know from which 
population it came: indeed, he may not even know the probability that it arose from a 
particular population. The observer must Judge from which population the receiver 
input came. 


1.2 Sampling plans 


A sampling plan is a system of making a sequence of measurements on the 
receiver input during the observation interval in such a way that it is possible to re- 
construct the receiver input for the observation interval from the measurements. 
Mathematically, a sampling plan is a way of representing functions of time as sequences 
of numbers. The simplest way to describe this idea is to list a few examples. 

A: Fourier series on an interval. Suppose that the observation interval begins 
at time ty and is T seconds long, and that each function in the population SN and N can 
be expanded in a Fourier series on the Observation interval. The Fourier coefficients 
for each particular receiver input can be obtained by making measurements on that 
input, which can in turn be reconstructed from these measurements by the formula 


“(t) = ay + 


" 


cOs—— + b, sin 


--4 2nnt Zant 1) 
2 an th<t<utT. ( 
lI 


Thus the Process representing each function (1) by the sequence of its Fourier co- 
efficients (ay, ai, Mise stp Myre a) BS Bh sampling plan in the sense described above. 

The pair of terms in the Fourier series which involve the cosine and sine of 
2nnt/Tis of frequency n|T cycles per second. Suppose that for a particular population 
Of receiver inputs the terms of frequency greater than nol T are zero; i.e., the population 
is bandlimited in the Fourier Series sense or simply “‘series-bandlimited." For sucha 
population the process of representing each receiver input (1) by the finite sequence 
(ap anh... 0s bi) IS a finite sample plan.* 


* A sampling plan is finite if there is 
receiver inputs in the population. 


a finite maximum length for the sequences for all 
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| B: Shannon's sampling plan. Suppose that the observation interval includes 
all time and that the populations are “transform-bandlimited” to a band from 0 
to W cycles per second, i.e., the Fourier transform of every receiver input is zero 
A sampling plan for this population is to represent 


for frequencies greater than MW. 
d 1/2W seconds apart, 


each function (1) by its amplitude measured at times space 


or : “(to — nl2W), alto — U2W), (0), to + 12W),...d (to + A[2W),...). In 
this case the formula [2] for the reconstruction of the receiver input is 
n \sin =(2W(t — to) — nl] 
- (2) 


= টু ৩5৭ 
A) Xl 377] FEWG— 0) nl 


times. Each choice of to between 


The instants of time to + n/2W are called sampling- 
l again includes 


O and 1/2W yields a different sampling plan. If the observation interva 
all time, but the populations are transform-bandlimited to a frequency band from 
fo = WI2 to fo + W]/2 which does not contain zero frequency, then each receiver input 
(1) can be considered as an amplitude and frequency modulated waveform, (1) = 
r(1t)cos (2nf/ot + 0(00)); r(t)is the amplitude of the envelope and 0(f) is the instantane- 
ous phase of the carrier. A sampling plan employing sampling-times is obtained in this 
case by representing each receiver input by the sequence (. . . r(to), G00); ea sro: + nl W), 
Ot, + nfW),...) of envelope amplitudes and carrier phases measured at sampling- 
times spaced by 1/W seconds apart [1]. The reconstruction of the receiver input from 


this sequence is given by 


pe nn be d n\Tsin =[W(t — to) — nl] 
0 = DX (+5) [2 ( ) ET 
(1) A r to + 7) cos [ fot + 0[ to + TT AWG). EE , 


oo 
observation interval. Only 
e Fourier transforms, and therefore the hypothesis 
)-bandlimited applies only when the observation 
interval includes ail, tiie. IU TUS observation interval is of finite length and if the 
Populations are series-bandlimited, then there are sampling plans দৰ sampling- 
times which are similar to those described in paragraph B for transform-bandlimited 
Populations and an infinite observation interval. Suppose that time is measured from 

and suppose that the 


the beginning of the observation interval, which is T seconds Io PE 
Populations are series-bandlimited from0to W cycles per SECON ARPARE 


plan for this situation can be obtained by representing each receiver input by the 


Sequence of its amplitudes measured 1/2W seconds apart 0 
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C: Sampling plan using sampling-times for a finite 
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that the populations are transforn 
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+ 719 
Populations are series-bandlimited on this interval to a frequency band from fo ন 
to fo + W|2 which does not include zero frequency, then each receiver input 
represented by a finite sequence [r(to), (ro), r(to + 1/W), Or, + 1W),..., nto Ey ie 
UW), 0 + T-1/W)] of envelope amplitudes and carrier phases measured at তু 4 
points 1/W seconds apart; to is again used to denote the initial sampling time whic I 
may be chosen anywhere from 0 to 1/W. The reconstruction of the receiver input from 
this sequence of measurements is given by 


EE-1 n n sin =[W(r — 10) = nl] 
A)= Sy "(vo + 7) cos EZ + ov + 7) Te 
a ¥ WTSin: | ss —— —— | 
” WT 
0 EHET. (6) 


From these examples it can be seen that there are a number of important dif- 
ferences between various sampling plans such as (a) the length of the observation 
interval, (b) whether sampling-times are employed, and (c) whether the measurements 
are all to be of the same kind, e.g., instantaneous amplitude measurements, or of dil- 
ferent kinds, €.g., envelope amplitude and carrier phase. However, they all have in 
common the property that the receiver input can be reconstructed from the measure- 
ments made on it. 

The role which the sampling plan plays in the theory presented in this Paper 1s 
Primarily one of mathematical convenience. The populations N and SN will be 
represented as sequences through the use of sampling plans in order to apply statistical 
methods. Once an answer is Obtained concerning an “optimum receiver, it is often 
Possible to translate this answer back to the more familiar language of receiver inputs. 
Ifa finite-sampling plan is not available for a particular application of the theory, then 
recent work by Grenander [3] shows that the desired parameters of the “optimum 
receiver can be approximated by using finite-sampling plans. Both for this reason and 


in order to Simpify the exposition, the theory presented here is restricted to cases where 
finite-sampling plans are available. 


2. Optimum Tests on Fixed Observation Interrals 
2.1 Probability density functions 


This part of the paper is concerned with a method of statistical analysis which 
requires for raw data a finite sequence of numbers (0y, 12, . tn), Which is the result 
of the measurements made at the receiver input according to some particular finite- 
Sampling plan. The sequence is often called a “sample” of the population from which 
it arose, and is denoted by a single letter: thus, if the receiver input is (1), and the 
sampling plan yields a Sequence (x, m5, ... » tn), then this sequence is called the sample 
AX. The theory to be developed here is intended to specify an optimum receiver and 2 
couched in the language of samples, X = (2,0... 2). Tf nis very large, a receiver 
which had to make the measurements called for bya sampling plan would certainly 
be impractical. However, this practical difficulty is avoided when the specification of 
the receiver is translated back from the language of samples to the language of the 


receiver inputs; this Can be done because it is Possible to reconstruct the inputs from 
the samples. 
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For the purposes of the subsequent development any finite samping plan may 
be considered, provided enough properties are known of the associated sample X so 
that certain probabilities may be calculated. Specifically, the probability density func- 
tions fM(X) and fs) of the sample variable X for the cases when Y is drawn from 
Populations N and SN, respectively, must be known.* The two basic properties of 


density functions are f° 
AAX)20 | AMDdAX =, 
and Li 


fs) 20 || fss)dX =! 


s the multiple integral taken over the entire 


Where the integration symbol represent 
range of the sample variable X = (By, 95s 


say le 


2.2 The concept of a criterion 

Consider nowan observer who has as available data the sample X = (mj, - «5 Tn). 
The observer's job is to judge for each sample whether or not it was taken from popula- 
tion SN. Although itis not possible to determine the (probably subconscious) criterion 
used by the observer, it is quite possible to find an external manifestation of it. Ideally 
all that is necessary is to submit each possible sample to the observer and to record his 
Judgment. This will yield a tabulation of those samples which the Observer decided 
Were drawn from population SN. If any other observer is given this tabulation and 
instructed to base his decisions on it, he will behave exactly as did the first observer. 
Thus, the tabulation of these responses can be used to replace the mental criterion 
employed by the observer. Such a tabulation will also be called a criterion and will be 
denoted by the letter A, which refers to the phraseology common in statistics of 
“Accepting the hypothesis that a signal is present." The tabulation of the remaining 
samples, those which the observer concluded were drawn from population N, will be 


denoted by B. 


2.3 Probabilities associated with criteria 

There are, of course, as many different criteria as there are observers. Among 
all possible criteria it is necessary to select those that are best for various purposes. To 
do so, certain numerical quantities must be associated with each criterion. It will be 
necessary to know the probability that a sample from one of the populations will be 
listed in a particular criterion A. According to the standard definitions, these prob- 


abilities are given by 
Psd) =] so dX 
(8) 


and 
P(A) = fa AX, 


n over all samples listed in the criterion A. 

“the event of the sample being drawn 
being present at the receiver input. 
the same thing. 


Where the multiple integral is take 

* In this discussion it should be kept in mind that 
from population SN" corresponds to signal and noise 
Also “the event of population SN being sampled" means 
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For example, a particular sample plan might have a density function of the form 
ANtp ag... tn) = Kexp[- (ni +13 +... +12). A possible criterion would 
consist of those samples X = (rj,.r3,...,.tn) Which lie outside a sphere of radius 1 
centered at the origin. Then the integral would be taken over the exterior of this sphere. 

These probabilities have a special significance. P(A) is the conditional prob- 
ability that a sample from population N will be listed in criterion A: that is, will be 
Judged as a sample from population SN. Thus P(A) = F is the conditional false 
alarm probability. Also, Ps~(A) is the conditional probability of a certain kind of 
correct response called a hit (that of judging Correctly that a sample is from population 
SN). The conditional probability of judging falsely that a sample is from population 
SN is, therefore, given by 1 — Ps(A) = M, the conditional probability of a miss. 
The only errors which can occur are false alarms and misses: their conditional prob- 
abilities, F and M, are called briefly the error probabilities. 

A reader familiar with the formal content of probability theory should note that 
these quantities are true conditional probabilities; the first is conditional on the sample 
being drawn from population SN: the second is conditional on its being drawn from 
population N. This is to distinguish them from a Priori probabilities (the probabilities 


that a certain population will be sampled, for example) which are not as yet assumed 
known. 


2.4 Likelihood ratio and the ratio criteria 


It is convenient to introduce a new function called the likelihood ratio, I(X), 
defined as the ratio SSX) for sample points X = CREE 3 IX) represents 
the likelihood that the sample X was drawn from SN relative to the likelihood that it 
was drawn from N. Hence,if I(X)is sufficiently large, it would be reasonable to conclude 
that X was in fact drawn from Population SN,i.e., that ¥ should be listed in the desired 
“best” criterion. Thus, for each number B > 0,a certain criterion A4(/3) will be selected; 
A(B) is chosen by listing each sample X for which I(X) > f. The problem then reduces 
to that of making a wise choice of #; that is, to determine how large “‘sufficiently large™ 
is. Criteria of the form A(P) will be called ratio criteria. 

A number of writers have Presented varying definitions of a criterion being 
“optimum.” It turns out that each of these optimum criteria can be expressed as aratio 


criterion, so that a receiver designed to yield likelihood ratio as output could be used 
with any of them. 


2.5 Weighted combination criteria 


Suppose it is Possible to assign a certain number w 
resenting the importance of a false alarm relative to 
ability of a hit, and P(A) the Probability of a false al 
to find a criterion A which maximizes the q 


as a weighting factor rep- 
a hit. Since Psys(A) is the prob- 


arm, it would then be reasonable 
uantity 


P(A) — wP(A). (9) 


But this quantity can be written as 


f [ssOX) — wi(X)] ax, (10) 
A 
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where the integration is taken over the sample points X listed in A. To maximize this 
integral, one would list in 4 every sample for which the integrand was not negative. 
Solving that inequality for w, one sees that A should contain those sample points A for 


which 
fsx(X) 
(010) 


Thus the desired criterion A is simply A(w), and so itis a ratio criterion. 
2.6 Neyman-Pearson criteria 

If it is critically important to keep the probability of a false alarm P(A) below 
a certain level k, then it would be reasonable to choose from among such criteria that 
one which maximizes the probability of a hit. Thus Neyman and Pearson proposed 
[4] as a type of optimum criterion any criterion A; for which 


0) Pd) SS kK, and 


(2) Ps(Apn)isa maximum for all the criteria A with the property P(A) Eo 


The Ai; type criterion can also be expressed as a ratio criterion. This can be 
made plausible as follows. To begin with, it is necessary to consider only those criteria 
A for which P(A) = k, because 4 will be taken as large as possible in order to meet 
condition (2). Now consider the curve given parametrically by the equations 
X = X(B) = PSIACB)] 

and (12) 
Y = Y(B) = PsyslA()]. 

ng Characteristic (briefly, ROC) curve, 


This curve will be called the Receiver Operati 
for a receiver whose output is likelihood ratio and with which ratio criteria are being 


used. 


h the points (0, 0) and (1, 1), the first at B = 0, 


> # = 0 for all X, so A(0) consists of all possible 
Very sample is drawn from SN, so he will 


The ROC curve passes throug 
the second at # = 0. At = 0, (AX) 


samples. Thus the observer will report that € ) i 
be certain to make a false alarm and to make a hit. (This assumes that the samples will 


not be drawn exclusively from one of the populations.) This can be verified, using the 
basic property of the density functions expressed by the following equations: 


PesxIA(O)] = | fs dX =! 
b 013) 


and 
PIA] = fnooax =, 


ossible samples X. These equations mean that 
MUO) =: YO) =: L..- MerEOVET X(5) = Hee) = 0, because for # = there are no 
samples X with (xX) 2 ; i.e., AC) contains no samples at all and the operator 
Will never report a signal is present. Therefore, the operator cannot possibly make a 


false alarm nor can he make a hit. Thus Pss[A()] = 0 and PM[A(o)] = 0. 
These considerations, together with those of the next section, show that the 


ROC curve can be sketched somew hat 


Where the integration is taken over all p 


ASN Eg; Ls 
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= 


Psy [A (6) 


Y(B) = 


0 X(8) = P, [A (B)] 


FIGURE I 
Typical ROC curve. 


To determine the desired A;., recall that all probabilities lie between zero and 
one, so that P(A.) = k is between zero and one. Then thereis a point Q of the ROC 
curve which lies vertically above the point (Kk, 0). The coordinates (Xx, Y) of Q are 
AX = PN[A(B)] =k and Y = Psv[A(B)], for some B, which will be written ff. Now 
A(f:) satisfies condition (1) because PN[A(B;)] = k, and therefore A(f#;.) will be the 
desired Aj. if Psm(A) < Psy[A(B;)] for any criterion with the property that P(A) = k. 
From paragraph 2.5, it is clear that the ratio criterion A(f;) is an optimum weighted- 
combination criterion with the weighting factor w 


= fi. Therefore, if w = fin, the 
weighted combination using the criterion A(f;) is greater than or equal to the same 
weighted combination using any other criterion Arles 
PsnlA(P)] — BEPSLACG] > Pend) — BPA). 04) 


In this case both PALA(P,)] and P(A) are equal to Kk. If this value is substituted into 
the inequality above, one obtains 


Psys[A(Bi)] > Pend). (15) 


Therefore, the desired Ne 
ular ratio criterion, A(B,.) 


2.7 ROC curve 


yman-Pearson criterion Ai Should be chosen to be this partic- 


It is desirable to digress for a moment to Study the ROC curve more closely. 
Its value lies in the fact that if the type of criterion chosen for a particular application 
is a ratio criterion, A(B), then a complete description of the detection system's perform- 
ance can be read off the ROC curve. By the very definition of the ROC curve, the ul 
coordinate is the conditional Probability F, of false alarm, and the Y coordinate is the 
conditional Probability of a hit. Similarly (1 — x) is the conditional probability of 
being correct when noise alone is present, and (I — Y) = M is the conditional prob- 
ability of a miss. It will be shown in a moment that the operating level # for the ratio 
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criterion A(B) can also be determined from the ROC curve as the slope at the point 
{PMA(B)], PsxlA(B)]}. 


Since most proposed kinds of optimum criteria can be reduced to ratio criteria, the 


ROC curve assumes considerable importance. 
In order to determine some of its geometric properties, it will be assumed that 


the parametric functions 

X = X(B) = PS[A(B)] 

and lad 
Y = YB) = PsyslA(®)] 


ctions of B#. The slope of the tangent to the ROC curve is given by 
ate the slope at the point [X(Bo), Y(Bo)], notice 
(4) is maximized by A = A(Bo). 


are differentiable fun 
the quotient (d Y/dB)I(dX/dB). To calcul 
that among all criteria A, the quantity Psy(A) — BoPy 
Therefore, in particular, the function 

Y(B) — BoX(B) = Pss[A(B)] — BoP slA(B)] 


so that its derivative must vanish there. Thus differentiating, 


(7) 


has a maximum at B = bo, 


dY dX 
ES Pa ii 6B P= (18) 
dB Po li 


Solving for fo, one obtains 


0 = (a YIdB)s=ny = the slope of the tangent to the 
(dX/dB) pn, ROC curve at the point [X(Bo), Y(Bo)]. (19) 


This shows that the slope of the ROC curve is given by its parameter B, and so is always 
addition, this means that Y(B) can be written 


Positive. Hence the curve rises steadily. In a j 
as a single valued function of X(B), Y = Y(X), which is monotone increasing, and where 
Y(0) = 0 and Y¥(l1) = 1. These remarks make fully warranted the sketch of the a 
Curve given in Fig. 1. The next two sections are concerned with determining the best 
value to use for the weighting factor w when a priori probabilities are known. 


2.8 Siegert's “Ideal Observer's” criteria 
ow beforehand the a priori probabilities that popula- 
ll be sampled. This is an additional assumption. 
ctively by P(SN) and P(N). Moreover, PSN) ৰ 
P(N) = 1 because at least one of the populations must be sampled. The criterion 
associated with Siegert's Ideal Observer is usually defined as a ETCH for which the 
a priori probability of error is minimized (or, equivalently, the SE probability ofa 
Correct response is maximized) [5]. Frequently the only case considered is that where 
P(SN) and P(N) are equal, but this restriction 1S not necessary. 

Since the conditional probability Fofa false A is known as well as the a 
Priori probability of the event (that population N was sampled) upon which F is 
conditional, then the probability of a false alarm 1s given by the product 


P(N)F. 


Here it is necessary to know 
tion SN and that population N wi 
These probabilities are denoted respe 


(20) 
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In the same way the probability of a miss is given by 
P(SN)M. 2D 


Because an error E can occur in exactly these two ways, the probability of error is 
the sum of these quantities 


P(E) = P(N)F + P(SN)M. (22) 


It has already been pointed out that F = PA(A)and M=!1- Psd). If 
these are substituted into the expression for P(E) a simple algebraic manipulation gives 


P(E) = P(SN) — P(SN)| Ps\(A) P(N) - P(A (23) 
১k SHA BG) Fath. 


It is desired to minimize P(E). But from the last equation this is equivalent to 
maximizing the quantity 
P(N) 


Pail df we oneretnts 
sd) P(SN) 


P(A), (24) 


and, of course, this will yield a weighted combination criterion with w = P(N)IP(SN), 
Which is known to be simply a ratio criterion Aw). 


2.9 Maximum expected-value criteria 


Another way to assign a weighting factor w depends on knowing the “expected 
value” of each criterion. This can be determined if the «a priori probabilities P(SN) and 
P(N) are known, and if numerical values can be assigned to the four alternatives. Let 
Vy, be the value of detection and Vy, the value of being “quiet,” that is, of correctly 
deciding that noise alone is present. The other two alternatives are also assigned 
values, Vy, the value of a miss, and V;., the value of a false alarm. The expected value 
associated with a criterion can now be determined. 
Optimum criterion as one which maximizes the ex 
Such a criterion maximizes 


POO Hy 
Paatld) = | sommes Bagel pী 25) 
sv) Ex Vi — PE Js 


In this case it is natural to define an 
pected value. It can be shown that 


By definition (see paragraph 2.5), this criterion is 


পক a weighted combination criterion with 
weighting factor 


কে P(N) } Vey. — Vy (26) 
P(SN) Vy -vV;)’ 
and hence a likelihood ratio criterion. 


Siegert's “Ideal Observer” criterion is the special 
Case for which Vy — Vp 


= Vp ~— Vy. 
2.10 A Posteriori Probability and signal det. 
Heretofore the obs 
noise is present” or ¢ 
best of his knowledge 


ectability 


erver has been limited to two possible answers, “signal plus 
‘noise alone is present.” Instead he may be asked what, to the 
: !s the probability that a signal is present. This approach has the 
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advantage of getting more information from the receiving equipment. In fact, Wood- 
ward and Davies point out that if the observer makes the best possible estimate of this 


probability for each possible transmitted message, he is supplying all the information 
which his equipment can give him [6]. A good discussion of this approach is found in 
the original papers by Woodward and Davies [6, 7]. Their formula for the a posteriori 
probability, PS(SN), becomes, in the notation of this paper, 


Px(SN) = fsx(X)P(SN) | 
x TENCOPENY (= ORE 07) 


or 
I(X)P(SN) 28) 


PA(SN) = 
x(SN) = Tx) P(SN) + 1 — P(SN) 


as its output can be built, and if the a priori 


probability P(SN) is known, a posteriori probability can be calculated easily. The 
calculation could be built into the receiver calibration, since (28) is a monotonic func- 
tion of ICY); this would make the receiver an optimum receiver for obtaining a pos- 


teriori probability. 


If a receiver which has likelihood ratio 


Sequential Tests with Minimum Average Duration 


3.1 Sequential testing 
make one measurement +; on the receiver 


input; if the evidence rt) is sufficiently persuading, decide as to whether the receiver 
input was drawn from population SN or from population N. If the evidence is not SO 
Strong, make a second measurement 12> and consider the evidence (21,22). Continue 
to make measurements until the resulting sequence of measurements is sufficiently 
persuading in favor of one population or the other. Obviously this involves the 
theoretical possibility of making arbitrarily many measurements before a final decision 
is made. This does not mean that infinitely many measurements must be made in an 
actual application, nor does it necessarily mean that the operation might entail an 
arbitrarily long interval of time. If,ina particular application, measurements are taken 
at evenly spaced times then the “time base” of such a measurement plan is infinite. 
However, another plan might call for measurements to be made at the instants ft = 0, 
EE Mp anit =O D/n, and as these times all lie in the time interval from zero to 
one, such a measurement plan w ould have a time base of only one unit of time. 

If the measurement plan has been carried out to the stage where n measurements 
have been made, the variable Xn = (1, ap) is called the nth 
A specific plan for measurements will be considered only if for 
each possible stage n, the two density functions fsn(An) and f(A) of the nth stage 
sample variable X, are known: the first of these density functions is applicable when 
population SN is being sampled and the second is applicable when population N is 


being sampled. These density functions may very well differ at different stages. so that 
U(X) and f(A): however, the n appearing in the argument 


they should be written fN 
A, Should always make the situation clear, and the superscript on the density functions 


themselves will be omitted. 


The idea of sequential testing is this: 


| 2 Y Ug 
1s tg «sn 2 


stage sample variable. 
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3.2 Sequential tests 


A sequential test will consist of two things: ee টু 
(1) An (infinite) measurement plan with density functions [MCX ) and fsx Xn), 
(2) An assignment of three criteria to each stage of the measurement plan. 
These three criteria represent the three possible conclusions: | | 
(A) Signal plus noise is present, i.e. the sample comes from Population SN, 
(B) Noise alone is present, i.e. the sample comes from population N, 
(C) Another measurement should be made. 

At the first stage of the measurement plan, any (real) number at all could 
theoretically result from the first measurement. This means that the first stage sample 
variable X; = (;) ranges through the entire number system, which will be written 5; 
to stand for the first stage sample space. Suppose the three first-stage criteria As, Bi, 
and C;,have been chosen. If the sample X; is listed in Aij,the conclusion that a signal 1s 
present is drawn and the test is terminated. Ifit is listed in B,;, the conclusion is that 
noise alone is present, and again the test is terminated. If A; should be listed in Ci, 
another measurement will be made, and the test moves on to the second stage instead 
of terminating. 

When the first stage criteria have been chosen, a limitation is placed on Sy, the 
space through which the second stage sample variable XY, = (vj, 2) ranges. The only 
way the test can proceed to the second stage is for YX, = (;) to be listed in C;. There- 
fore, Ss does not contain all possible second stage samples Xs, = (r;, x) but only those 
for which (r,) is listed in C;. Three second Stage criteria, Ay, By, and Cs, must now be 
chosen from those samples Xo listed in Ss. They must be chosen in such a way that 
there are no duplications in the listings and no sample in Sis omitted. These criteria 
carry exactly the same significance as those chosen in the first stage. That is, the three 
conclusions that a signal is or is not Present, or that the test should be continued, are 
drawn when the sample Xo is listed in As, Bs, Or C, respectively. oe 

The selection of criteria Proceeds in the same Way. If the nth stage criteria 
Au, By, and C,, have been chosen, then the Next stage's sample space S,,.; consists of 


those samples X.y = (nin... 2, an 1) for which X, = (ry, ag ,) was listed 
and Cyt1 


in C,. Then from S,.1 are drawn the three (n + 1) stage criteria A, vB 
When an entire sequence 


nels 


(4;, B,C), 
(B.C); 


(,, B,, C,), 


Of criteria is selected, a “sequential test” 
course that the test will necessarily be 
possible ways of selecting a Sequence 
may be particular ones Which are very 


has been determined. This does not mean of 
Particularly useful. However, among all the 


of criteria and hence a sequential test, there 
useful. 
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3.3 Probabilities associated with sequential tests 
If QO, is any nth stage criterion, then the quantities * 


[| 


PQ») [NAXn) dX, 
Qn 
(29) 


and 
Psx(Qn) = | Ssx(Xn) dX 
On 
at an nth stage sample Xn will be 


onditional probabilities th 
icular interest are: 


Conditional probabilities of part 
nal error probabilities: 

d, then the probability that the sample variable X, 
onditional probability of a false alarm. 
bability that the sample variable Xn 
al probability of a miss. 


represent the (Nor SN) c 
listed in the criterion Q,. 
(1) The nth stage conditio 

e If population N is sample 
will be listed in An is Py(An)- This is the N-c 
. If population SN is sampled, then the pro 
will be listed in Bn is Psy(Bu)- This is the SN-condition: 
(2) The conditional error probabilities of the entire test: 


bl PMA), the N-conditional probability of a false alarm, and (30) 


F= 
1 
cc 
M = Y Psy(Bn), the SN-conditional probability of a miss, (31) 
n=! 
are merely the sums of the same error probabilities over all stages. 
(3) The conditional probabilities of terminating at stage 1 are 
(32) 


TE = Py(An) + Px(Bn), 


and 

Ty = Psn(An) + Psn(Bn)- (33) 
stified by a simple argument. The only way the test can 
ariable X, to be listed in either An or Bi. The 
babilities of the component events which 
at most one of Ap and Bo. 
test will terminate are 


These equations can be ju 
terminate at stage 1 is for the sample V' 
Probability of this event is the sum of the pro 
are mutually exclusive since Xn can be listed in 

(4) The conditional probabilities that the entire 


Tn = 2 TH 


n=l 


and 
Tew = X T8w- 


n=l 

3.4. Average sample numbers 

There are two other quantities which must be introduced. One feature of the 

Sequential test is that it affords an opportunity of arriving at a decision early in the 

sampling process when the data happen to be unusually convincing. Thus one might 

* The notation | indicates that the integration is to be carried out over all sample 
Points listed in Qn-  *@n 


180 READINGS IN MATHEMATICAL PSYCHOLOGY 


expect that, on the average, the stage of termination of a well-constructed Segue 
test would be lower than could be achieved by an otherwise equal, good standard 
test. It is therefore important to obtain expressions for the average or expected AA 
of the stage of termination. As with other probabilities, there will be two ol these 
quantities: One conditional on population N being sampled; the other conditional on 
population SN being sampled. They are given by 


7 


Ev = DS nT (36) 
fe nl 
and 
Ey = XS nT, (37) 
nl 


The letter Eis used to refer to the term “expected value." The quantities Ey and Ess 
are called the average sample numbers. The form these formulas take can be justified 
(somewhat freely) on the grounds that each value, n, which the variable “stage of 
termination” may take on must be weighted by the (conditional) probability that the 
variable will in fact take on that value. | 

It should be heavily emphasized that the average sample numbers are strictly 
average figures. In actual runs of a Sequential test, the st 
times be less than the average sample numbers but will 
larger. Any sequential test whose average sample numbers are not finite would be 
useless for applications. Therefore the only ones to be considered are those with finite 
average sample numbers. Under this assumption, * itcan beshown that TN = TSN = U 
So that the test is certain to terminate (in the sense of probability). On the other hand, 
if it is known that Ty = Ty =! it does not always follow that the average sample 
numbers are finite. Such a situation would mean only that if 
test were made, each run would probably terminate, but the 
tion would become arbitrarily large 


ages of termination will some- 
also be, upon occasion, much 


a sequence of runs of the 
average stage of termina- 
as more runs were made. 


3.5 Sequential ratio tests 


In studying non-sequential tests using finite samples it was found that the best 
criterion could always be expressed in terms of likelihood ratio. Therefore, it may be 
useful to introduce likelihood ratios at each stage of an infinite sample plan. The nth 
Stage likelihood ratio function I(X,) is defined as the ratio fs(XDIfN(Xn). Optimum 
criteria in the finite-sample tests turned out to be criteria listing all samples AX for 
which ICY )is greater than or equal to a certain number. Tt should be possible to choose 
sequential criteria (4,, B,, C,) in the same way. For each stage two numbers an and 


b, with b, < an could be chosen. Then the criteria (A, Bh, Ch) determined by the 
numbers a, and b, would be 


As, lists all samples X, of the sample space S, for which (X,) 2 ay, 
B, lists all samples X, of the sample space 5S, for which ICR) Sb 
C, lists all samples Xn Of the sample space S,, for which by, <I) < an. 


* Remember that the sampling 


be a 0 
Process is not assumed to yield independence among 
the X.. 
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Cs criteria selected in this way meet the requirements that the average sample numbers 
be finite, then the resulting sequential test is called a “‘sequential ratio test.” 


3.6 Optimum sequential tests 
ne an optimum sequential test as that one for which 


It is customary [8] to defir 
nd Esy are minimum among all sequential tests with 


the average sample numbers Ey ar 


fixed error probabilities F and M. 
In addition to the formulas given in Secti 


the AVETAge sample numbers are 


Ev 


on 3.4, alternative formulas [9] for 


lk DPC) (38) 
i=1 


and 
Ee 
Egy = 1 + 2 Psx(C. 039) 
i=1 
Thus, if a set of sequential criteria (A¥, Bx, Ci) is presented as a possible optimum 
test, then its optimum character is decided by ascertaining whether the inequalities 
(40) 


SPACHD <2 PCI) 


and 
2 PsxC) 8) Pss(Ci) 4) 
Ba Cult with the same error prob- 


hold for every other set of sequential criteria (An, 


abilities, i.e., with 
PAA = 2 PAD Li 


and 
(43) 


5 Psy(Bi) = BS Psy(Bi). 


The problem of constructing an optimum sequential test is difficult because the 


equalities (42) and (43) can be satisfied even when there is no apparent term-by-term 


relation between the sequences {PACD} and {PC}. Wald has proposed as opti- 
} and {b,} is constant, thatis, b, = bn 


ach of the sequences {an} 
ald and Wolfowitz [10] proved that these tests are 
s at Successive stages are independent, as can 
hen both noise and signal plus noise consist of “random 
plications of the theory 


domness’" is not met with in most ap 
t in the sense that the hypotheses of Wald and Wolfo- 


mum the tests in which € 
anda, = a, forall n. Moreover W. 


optimum whenever the density function 


be the case for example w 
noise.” However, this “ran 
of signal detectability, at least no 
witz are satisfied. 
Consider a tes 


scribed in Section 2, with error probabilities 
tial test with these same error probabilities 
e disadvantage that it will sometimes 


t of fixed length as de: 


F and M. Although the optimum sequen 


generally requires less time on the average, it has th 
use much more time than the fixed length test requires. In a conversation with the 


authors, Professor Mark Kac of Cornell University suggested that the dispersion, or 
variance, of the sample numbers may be so large as seriously to affect the usefulness 
of the sequential tests in applications to signal detectability. Certainly this matter 
should be investigated before a final decision is reached concerning the merits of 
sequential tests relative to tests on 4 fixed observation interval. However it is a difficult 
matter to calculate the variance of the sample numbers. Therefore an electronic 
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simulator is being built at the University of Michigan which will simulate both types 
Of tests and will provide data for ROC Curves of both types as well as the distribution 


of the (sequential) sample numbers. 


4.1 Introduction 


4. Optimum Detection Jor Specific Cases 


The chief conclusion obtained from the general theory of signal detectability 


Presented in Section 2 of this paper is that 
ratio for each receiver input is the optimu 


a receiver which calculates the likelihood 
m receiver for detecting signals in noise. 


Application 


TABLE I 
Section Description of Signal Ensemble 
4.4 Signal known exactly* 
4.5 Signal known except for phase * 
4.6 Signal a sample of white Gaus- 
sian noise 
4.7 Detector output of a broad band 
receiver 
4.8 A radar case (A train of pulses 
with incoherent phase) 
4.10 Signal one of M orthogonal 
signals 
4.11 


Signal one of M orthogonal Sig- 
nals known except for phase 


Coherent radar with a target of 
known range and character 


Ordinary pulse radar with no 
integration and with a target 
of known range and character 


Detection of noise-like signals; 
detection of speech sounds in 
Gaussian noise 


Detecting a pulse of known start- 
ing time (Such as a pulse from 
a radar beacon) witha crystal- 
video or other type broad 
band receiver 


Ordinary pulse radar with oT 
gration and with a target 0 
known range and character 


Coherent radar where the A 
is at one of a finite number (0) 
non-overlapping positions 


Ordinary pulse radar with no 
integration and with a target 
Which may appear at one of a 
finite number of non-over- 
lapping positions 


* Our treatment Of these two fundame 
work, but here they are treated in terms of li 
receivers as well 
Solved for the more 
spectrum [11,12]. Th 


considerably more invo! 
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4 to consider a number of different ensembles of 


signals with bandlimited white Gaussian noise. For each case, a possible receiver 
design is discussed. The primary emphasis, however, is on obtaining the probability of 
detection and probability of false alarm, and hence on estimates of optimum receiver 
performance for the various cases. 

The cases which are presented were chosen from the simplest problems in signal 
detection which closely represent practical situations. They are listed in Table I along 
with examples of engineering problems in which they find application. In the last two 
cases the uncertainty in the signal can be varied, and some light is thrown on the 
relationship between uncertainty and the ability to detect signals. The variety of 
examples presented should serve to suggest methods for attacking other simple signal 
detection problems and to give insight into problems too complicated to allow a direct 
Solution. 

The reader will find the discussion of likelihood ratio and it 
to follow if he keeps in mind the connection between a criterion 
likelihood ratio. In an optimum criterion type system, the operator will say that a sig- 
nal is present whenever the likelihood ratio is above a certain level B. He will say that 
Only noise is present when the likelihood ratio is below B. For each operating level 
B, there is a false alarm probability and a probability of detection. The false alarm 
probability is the probability that the likelihood ratio ICY) will be greater than B if no 
signal is sent; this is by definition the complementary distribution function Fy(B). 
Likewise, the complementary distribution Fs~(f) is the probability that ICY) will be 
greater than B if there is signal plus noise, and hence Fsyx(f) is the probability of 


detection if a signal is sent. 


It is the purpose of Section 


s distribution easier 
type receiver and 


4.2 Gaussian noise 

In the remainder of this paper the receiver inputs will be assumed to be defined 
On a finite-observation interval, 0 < t < T. It will further be assumed that the receiver 
inputs are series-bandlimited. By the sampling plan C (Section 1.2) any such receiver 
input (1) can be reconstructed from sample values of the function taken at points 112W 


apart throughout the observation interval, i.e., 


2WT 
a(t) = p Tet), (44) 
+=1 
where 
2 (t k ) 
sin # 2WT( 7 — IVT ন ale 
(0) = and «= (5) - 45) 


EE 
2WTsin [7 - I 


s can be represented by the sample (ti,22,..., tow T)- 


Therefore the receiver input ple ( 
In Section 4 the notation will be used to denote either the receiver input function a(t) 
, 2277): Similarly the signal s(1), or simply s, can be represen- 


Or the sample (ty, 22, - + 
ted by the sample (51, - - - sap"T), Where st. = s(k/2 Ww). 
Only the probability distributions for receiver inputs 2(1) can be specified. The 


distribution must be given for the receiver inputs both with noise alone and with signal 
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Plus noise. The Probability distributions are described by giving the probability 
density functions fs) and f(r) for the receiver inputs x. 
The Probability density 


function for the receiver inputs with noise alone are 
assumed to be 


IN®) = Ie [- 


i=l 


or (46) 
1 We AE 
At) = (53) op | 52 শ] | 


Where nis 2WT and Nis the noise power. It can be verified easily that this probability 
density function is the description of noise Which has a Gaussian distribution of 
amplitude at every time, is stationary, and has the same average power in each of its 


Fourier components. Thus we shall refer to it as “stationary bandlimited white 
Gaussian noise.” 


The functions y,(1) are orthogonal and have energy 1/2W, and therefore 


Sq 
১X৭ =2৮ | Ln at, (a) 
0 
SO that 
1 Vz 1 Tr El 
Ax) = — it ' Ss 7" 2 
fn) (23) er [ 5 (1) l, 


where Np = N/W is the noise power Per unit bandwidth. 

In a practical application, information is given 
appear without noise at the receiver input, rather than about the signal plus noise 
probability density. Then sxe) must be calculated from this information and the 


probability density function JM) for the noise. The noise and the signals will be 
Assumed independent of each other. 


If the input to the receiver is the sum 


about the signals as they would 


(density) that sr) and (0) — s(1) will occur together, averaged over all possible s(1). 
If the Probability of the Signals is described by a density function f(s), then 


fsx) =Jne 7 s)fs(s) ds, (49) 
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all values of s weighted by the probability Ps. If f»(0) is taken from Eq. (46), this 


becomes 


pe 1 Ye + 
ss) =JAe - 5 dps) = (55) ES [- 7 2 - a] dPs(s) 
2! gl “NV i=1 
nile n Fr n 61) 
(23) ER ex ED ex AS dPs( 
SN) SPL TNE 7 DN XP [x 200) APs) 
1 nile pr l 1 br \ 
fsx) = [A — 5) dPs(S) = (25) lex EAL [200 — sO] afl APs(s) 
(52) 


1 nl2 iTS fe 1 i ন! i 
Gl dfeel- Lede lille 


0 
a 
221) “l = exp [—(2N) > 22] can be brought out of the 


The factor exp ED 
0 
fintegration. Note that the integral 


integral since it does not depend on s, the variable 0 


oT 1 2 
| sp dt = 5 > 5 = EOS) (53) 
0 2W 
is the energy * of the expected signal, while 
eT 1 
| sO dt = 52 (54) 
0 “র' 


is the cross correlation between the expected signal and the receiver input. 


4.3 Likelihood ratio with Gaussian noise 
Likelihood ratio is defined as the ratio of the probability density functions 
Sfs~() and fy(). With white Gaussian noise itis obtained by dividing Eq. (51) and (52) 


by (46) and (48) respectively: 


% g fa 
Jor = | exp [5 $l dP s(S), (55) 
0 i=] 


i [ a p 2 
ERB: =n, IVP. RE 
No No 
ecified, the probability for that 
P YY 


If the signal is known exactly or completely sp 
signal is unity, and the probability for any set of possible signals not containing s is 


zero. Then the likelihood ratio becomes 
TES on EL Sas K 
AOE le MR SE (57) 


E(s) 2 
12) = exp | -— ন exp Ns 


55) and (56) for likelihood ratio state that I(.+) is the weighted 


|| 


IC) 


or i 
a(t)s(t) «l dPs(S). (56) 


“0 


|| 


In) 


or 


°T 
a(t)s(t) l 4 (58) 


0 


Thus the general formulas ( 


* This assumes that the circuit impedance is normalized to one ohm. 
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average of /(+) over the set of all signals, i.e., 


a)= | LC) dPS(s). (59) 


An equipment which calculates the likelihood ratio /(.-) for each receiver input 
t is the optimum receiver. The form of equation (58) suggests one form which io 
equipment might take. First, for each possible expected signal s, the individual like- 
lihood ratio /(x) is calculated. Then these numbers are averaged. Since the set of 
expected signals is often infinite, this direct method is usually impractical. It is fre- 
quently possible in particular cases to obtain by mathematical operations on Eq. (58)a 
different form for /(.e) which can be recognized as the response of a realizable electronic 


equipment, simpler than the equipment specified by the direct method. It is essentially 
this which is done in the following paragraphs. 


If the distribution function Ps(s) depends on various parameters such as carrier 
Phase, signal energy, or carrier frequency, and if the distributions in these parameters 
are independent, the expression for likelihood ratio can be simplified somewhat. If 
these parameters are indicated by ri, ra, ...; rn, and the associated probability density 
functions are denoted by fir), fara), . . . sfn(rn), then 

dPs(s) = fi0;) -- “fulrn) dry + + dry. 
The likelihood ratio becomes 


(0) =] | fine furan) dr; - - - dry 


-) lac jai [ Je J) an i | dhs 


Thus the likelihood ratio can be found by averaging /,( 


(60) 


©) with respect to the parameters . 
4.4 The case of a signal known exactly 


The likelihood ratio for the case when 
been presented in Section 4.3: 


n 
1@) = exp [- x exp EE 3] | (61) 
0. 


i=! 


E 2 
102) = exp [- x] exp El “(1)s() a. (62) 
) 


As the first step in findin 
find the distribution for (1/N) 2 xis, when there is noise alone. Then the input = 
Qi, 2,..., 2) is due to White Gaussian noise. Tt can be seen from Eq. (46) that each 
; has a normal distribution with zero mean and variance N = WNpy and that the @; 
Because the s; are constants depending on the signal to be detected, 
8 == (5155, 5 5); ‘each Summand (z;5;)/N has a normal distribution with mean 
SilN times the mean of ti, and with variance (5,/N )* times the variance of @;, which are 
Y. Because the .r, are independent, the summands (s;7;)/N are 
normal distribution, and therefore their sum has a normal 


the signal is known exactly has already 
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distribution with mean the sum of the means—i.e., zero—and variance the sum of the 


variances. 
5 2WE() 2E yr Signal Energy | 
N N No ™ Noise Power Per Unit Bandwidth i) 65) 
The distribution for (1/N) S axis; with noise alone is thus normal with zero mean and 


variance 2EJNo. Recalling from Eq. (61) 
K $ E 1 
(2) =exp | = HR + ন 238 |; (64) 


IN) DX ais; can be used directly by introducing « 


one sees that the distribution for (1 


defined by 
E 
IE or a= tn. (65) 


[1 0 


চু 
f = exp দুত টং 


The inequality I(©) > B is equivalent to (UN) 2s; 2 «, and therefore 


A 1N' 
FAB) = Er exp | -s5 Lp | dy. (66) 
N FE. “PL 22E 


The distribution for the case of signal plus noise can be found by using Eq. (19), 
which states that 
dPs~[A(B)] 
ssl 1 (6) 
AP SLAC] 
Because these probabilities are equal to the comple 


likelihood ratio, this can be written as 
dFsx(B) = B dFxs(6). (68) 


at B=ho 
mentary distribution functions for 


Differentiating Eq. (66), 


IFS(B) = uf A MN 69) 
AFN(B) = = IRE ~—-3E i) ( 
and combining (65), (68), and (69), one obtains 
No E Noe 
ms a E60 ——_ == ডি 70. 
dFs M(B) = MEE [ A + | do (70) 
Thus, 
NE J | ্] 
mm wm 20 = 7) | dv 71 
Fsx() = J | exp [ lv -x)) an) 


tions with signal plus noise as 


have normal distribu 
s 2EINo, and the difference 


In summary, ® and therefore In B, r 
nce of each distribution i 


well as with noise alone; the varia 
of the means is 2E|No. 


The receiver ope ic curves in Figs. 2 and 3* are plotted for 


rating characterist 
any case in which In Ihas a normal distribution with the same variance both with noise 


alone and with signal plus noise. The parameter din this figure is equal to the square of 
* In Fig. 3, the receiver operating characteristic curves are plotted on “‘double-proba- 
bility” paper. On this paper both axes are linear in the error function 


ei '% 
ef) =0lV 27) | exp [12] dt; 


this makes the receiver operating characteristic straight lines. 
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Few (0 


© 0 0 Os HM HE ME DTI HE IE I 
Eyl) 
FIGURE 2 
Receiver operating characteristic. n lis a normal deviate with 


0%, = n2,, (M,, — M,)}=d. ay. 


the difference of the means, divided by the variance. These receiver operating charac- 
teristic curves apply to the case of the signal known exactly, with d = 2E/No- ial 
Eq. (62) describes what the ideal receiver should do for this case. The essentia 


টাৰ 
Operation in the receiver is obtaining the correlation, | s(t)x(r) dt. The other opera- 
1) 


tions, multiplying by a constant, adding a constant, and taking the exponential func- 
tion, can be taken care of simply in the calibration of the receiver output. Electronic 
means of Obtaining cross correlation have been developed recently [13]. ধ্‌ 

If the form of the signal is simple, there is a simple way to obtain this cross 


correlation [6, 7]. Suppose (fr) is the impulse response of a filter. The response eo(t) 
of the filter to a voltage 1(1) is 


{ 
est) J (+) h(t — 7) dr. (72) 


— 00 
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If a filter can be synthesized so that 


ID =Ar—-t OStST 
(73) 


Mt) = 0, otherwise, 
then 


aT 
eT) -| x(7)s(7) dr, 4) 
0 


so that the response of this filter at time T is the cross correlation required. Thus, the 


ideal receiver consists simply of a filter and amplifiers. 


999 
d = 36 
j 99.5 
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L = + 90 
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| | % 
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FIGURE 3 


Receiver operating characteristic. In! is a normal deviate, Gy = 04 (Msy — My}: = do. 
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It should be noted that this filter is the Same, except for a constant factor, as that 


specified when one asks for the filter which maximizes peak signal to average noise- 
Power ratio [14]. 


4.5 Signal known except for carrier phase 


The signal ensemble considered in this section consists of all signals which differ 
from a given amplitude and frequency modulated signal only in their carrier phase, and 
all carrier phases are assumed equally likely. 


50) =f (0) cos [wr + (01) — 0]. (75) 


Since the unknown Phase angle 0 has a uniform distribution, 
l 
dPs(0) = — ao. (76) 
2 


The likelihood ratio can be found by applying Eq. (56), and since the signal energy 
£(S) is the same for all values of the carrier phase 0, 


IE | 
5) = ৰ teal F3 yt 77 
I) = exp [ x] IE. Eo Xs] APS(S). 77) 
Expanding s into the coefficients of cos 0 and sin 0 will be helpful: 


50) = £0) cos [wt + 4(1)] cos 0 + f(0 sin [or 4+ ¢(/)] sin 0, (78) 
and 


|| 1 
Re #5 = Cos 0 Ni if (1) cos [wr; + H(r,)] 
| 
+ sin 67 DX af (0) sin [or, + $00].* 09) 


Because we wish to integrate with res 
easiest to introduce parameters simil 


1 


pect to 0 to find the likelihood ratio, it is 
Ar to polar coordinates (r, 00) such that 


1 
RN! C050, = NS if (ti) cos [or, + ¢(r)] 
(80) 
i AA 1 f 
Rr sin 9% = NE 2; (i) sin [or, + %(r)], 
and therefore 
1 
N22 ES = 0s (0 — 0). 8D 
Using this form the likelihood ratio becomes 
er Ef or fen - a] 
) = ex Pe Fo ex x7 0S সহ Ee 
Nill PR ? 27 (82) 


EL) 


where J, is the Bessel function of Zero order 


and pure imaginary argument. 
* 1; denotes the ith sampling time, i.e., 1, = il2W. 
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Jo is a strictly monotone increasing function, and therefore the likelihood ratio 
will be greater than a value B if and only if r/Nis greater than some value corresponding 
to B. 
In the previous section it was shown that the sum (1/N) X a;s; has a normal dis- 
tribution with zero mean and variance 2E|/No if the receiver input 2(1) is due to noise 
alone; Eis the energy of the signal known exactly, s(f), and Nop is the noise power per 
cycle. Since f(1) cos [ot + $(0)] and f()sin [ot + ¢(1)] are signals known exactly, 
both (F/N) cos $0 and (r/N)sin %o have normal distributions with zero mean and 
variance 2E/No. The probability that due to noise alone 


KUEN TEASE ESE 
i ও SR ্ 
Fi (ue % + (ss lo 


known chi-square distribution for two 


will exceed any fixed value, is given by the well 
ization yielding zero mean and unit 


degrees of freedom, Ksx(«>). The proper normal 
variance requires that the variable be 


+) ELL 
(» 260)’ 


that is 
* 


(i 2 “) = Kat) = exp [- 5 | j (83) 
If 4 is defined by the equation 


E 2E 
se [- EJF -), (84) 


f noise alone is in the simple form 
ue 
FN(B) = exp [- 5 - (85) 


dFN(B) = —« exp [- 5] da. (86) 


the distribution for I(x) in the presence 0 


It follows from (85) that 


If in equation (68), namely (87) 


B dF (PB) = dFsx(), 


ed by the expression given in (84) and dEN(B) is replaced by that given in 


£E a2 2E 
dFsx(B) = —exp [- AE [- 3 hnl Ns a] da (88) 
S: 2 


is obtained. Integration of (88) yields 


Ef a 7 | )a 
Fsx(B) = exp El a exp [-5ৃ 0 Ni) & (89) 


E 


B is replac 
(86), then 


* The symbol Pa > 4) denotes the probability that the variable x is not less than the 


Constant «. 
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Receiver operating characteristic. Signal known except for phase. 
Eqs. (85) and (89) yield the receiver operating Ccharacteri 
Eq. (84) gives the associated Operating levels [15]. These 
of the same values of signal energy to noise power per 
when the Phase angle was known exactly, Figs. 
the phase can be easily seen. 

Tf the signal is sufficiently simple so that a filter could be synthesized to match the 
expected signal for a given carrier phase 0 as in the case of a signal known exactly, 
then there is a simple Way to design a receiver to obtain likelihood ratio. For simplicity 
let us consider only amplitude modulated signals [¢(/1) = 0] in Eq. (75). Let us also 


choose 0 = 0. (Any phase could have been chosen.) Then the filter has impulse 
response 


stic in parametric form, and 
are graphed in Fig. 4 for some 
unit bandwidth as were used 
2 and 3, so that the effect of knowing 


At) = f(T — 1) cos [AT — DM O0O<i<gr, 


(90) 
= 0, 


otherwise. 


W. W. PETERSON, T. G. BIRDSALL, AND W. C. FOX 193 


The output of the filter in response to (1) is then 


t 1 
ent) -f A) —- 7)ds = | Hof ET = 1)cos olT + T — 1) dr 
ী -T 


[4 
= cos A(T — 1) | (7) f(¢ + T - 1) cos wr dz 
Jr 


ul 

— sin AT —1) A(aYfC + T — 1)sin wr ds. (91) 
J-T 

will be the square root of the sum of the squares 


The envelope of the filter output 
T will be proportional to r[N, since 


of the integrals, * and the envelope at time 


HE - PT পু kd 2 
(=) = i (+7) f (7) COS WT «| + [ (7) f (7) Sin wr al y (92) 


are of the envelope of eof) at time T. If the input 
esponse given by Eq. (90), then through a 
at time T. Because the likelihood ratio, 
an be calibrated to read 


which can be identified as the squ 
(1) passes through the filter with an impulse r' 
linear detector, the output will be (Nol2)rIN 
Eq. (82), is a known monotone function of r[N, the output ¢ 


the likelihood ratio of the input. 


4.6 Signal consisting of a sample of white Gaussian noise 


Suppose the values of ihe signal voltage at the sample points are independent 
Gaussian random variables with Zero mean and variance S, the signal power. The 
probability density due to signal plus noise is also Gaussian, since signal plus noise is 


the sum of two Gaussian random variables: 


1 n/2 1 1 ন 
Ast) = (2 exp [- IRTS2 ] ঈ 1 


where n = 2WT. 
The likelihood ratio is 


N nl2 1g ি | | 1 2 
10) = (Fl exp E RX = 2) ্‌ (94) 


N+ 
In determining the distribution functions for lL, it is convenient to introduce the 


parameter «%, defined by the equation 


fi 0 KY) ) 
Fu (Fa (জনত | 
the condition that (LIN) > 2? > «2, In 


alone the random variables w;/ VN have zero mean and unit 
Therefore, the probability that the sum of the 
is the chi-square distribution with n degrees 


(96) 


(95) 


Then the condition (2) > Bis equivalent to 


the presence of noise 
variance, and they are independent. 

2 
squares of these variables will exceed a* 
of freedom, 1.6.. F(R) = Kulo®). 


of s(1) is zero at zero frequency and at all frequencies equal to or 


* If the line spectrum 
t these integrals contain no frequencies as high as 


greater than 20/2, then it can 
w/2r. 


be shown thal 
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Similarly, in the presence of signal plus noise the random variables #/VN +S have 
zero mean and unit variance. The condition (1/N) NX 1? > 22 is the same as requiring 
that [LI(N + S)] > 2? > [NIN + S)]%2, and again making use of the chi-square dis- 
tribution, 
N ্ী 
00) = Rl 08) (97) 
Ess) = KET ) 
For large values of n, the chi-square distribution is approximately normal over 
the center portion; more precisely [16], for 42 0, 


1 - 1 
E(B) = KA(«2) = —= f exp [- 5 l dy, (98) 


V2 _VIn—I 


f N | E Ll 
Fas) = Kl +5 “) MEG | i [= a ‘| A 


and 


যা 


- Ven 1 


If the signal energy is small compared to that of the noise, VNI(N # 5) is nearly unity 
and both distributions have nearly the same variance. Then Figs. 2 and 3 apply to 
this case too, with the value of d given by 


N 
= = Es 100) 
f= (2t (i চয় 75): ¢ 


For these small signal to noise ratios and large samples, there is a simple 


relation between signal to noise ratio, the number of samples, and the detection 
index d. 


1 fA EL Se 
= 5 or— <1, 
and AES: x (100 


Two signal to noise ratios, (S/N), and (S/N), 


y will give approximately the same operat- 
Ing characteristic if the corresponding numb. 


ers of sample points, n;, and ns, satisfy 


25 KS (102) 


By Eq. (94), the likelihood is a monotone function of X22. But the output of an 
energy detector, 


Hl 
ent) =| [20 dt = 7 2 (103) 


is proportional to > x2. 


Therefore an energy detector can be calibrated to read likeli- 
hood ratio, and hence ca 


1 be used as an optimum receiver in this case. 
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4.7 Video design of a broad band receiver 

The problem considered in this section is represented schematically in Fig. 5. 
The signals and noise are assumed to have passed through a band pass filter, and at the 
output of the filter, point A on the diagram, they are assumed to be limited in spectrum 
to a band of width W and center frequency of2q > W|2. The noise is assumed to be 
Gaussian noise with a uniform spectrum over the band: The signals and noise then 
pass through a linear detector. The output of the detector is the envelope of the signals 
and noise as they appeared at point A; all knowledge of the phase of the receiver input 
is lost at point B. The signals and noise as they appear at point B are considered re- 
ceiver inputs, and the theory of signal detectability is applied to these video inputs to as- 
certain the best video design and the performance of such a system. The mathematical 
description of the signals and noise will be given for the signals and noise as they appear 
at point A. The envelope functions, which appear at point B, will be derived, and the 
likelihood ratio and its distribution will be found for these envelope functions. 

The only case which will be considered here is the case in which the amplitude of 
the signal as it would appear at point Aisa known function of time. 

Any function at point A will be band limited to a band of width W and center 
frequency of27 > W/2. Any such function f (0) can be expanded as follows: 


f0) = (1) cos wf + YN) sin of, (105) 


nd limited to frequencies no higher than W/2, and hence can 


where a(f) and y(t) are ba 
by sampling plan C, yielding 


themselves* be expanded 


f0)= % (i) y(t) cos wt + u(r) sin ol. (106) 


The amplitude of the function f (1) is 


beta 
0) = VEOF + WO, (107) 
and thus the amplitude at the ith sampling point is 
(a) =ri= Val +. (108) 
The angle 
5 (109) 


Yi bs 
0; = arctan — = arccos — 
2 ri 
at the ith sampling point. The 
d 0; rather than the z; and yi. 


Linear 
detector 


might be considered the phase off 0) function f (1) then 
might be described by giving the ri an' 


Input from Band pass 
antenna filter 


or mixer 
Point A 


Video 
amplifier 


FIGURE 5 
am of a broad band receiver. 


Block diagr: 


* Because any function f(1) at Ah 
usual sampling plan C might have been use. 
fx (ei), would probably not be applicable. 


as no frequency greater than (w/27) + (W/2), the 
d on f(t). However, the distribution in noise alone, 
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Let us denote by x;, y;, Or ri, 0;, the sample values for a receiver input after the 
filter (i.e., at the point A in Fig. 5). Leta; b,, Or f;, %;, denote the sample values for 
the signal as it would appear at point A if there were no noise. The envelope of the 
signal, hence the amplitude sample values f;, are assumed known. Let us denote by 
Fs(%;i, $2, . . . , bn/2) the distribution function of the phase sample values ¢;. The 


probability density function for the input at A when there is white Gaussian noise and 
no signal, with n = 2WT, is 


1 2 1 72 ডঃ 2) 
Ae (25 | - (১ +১ }h i) 


i=l 


and for signal Plus noise, it is 


1 Yue 1 uz ul2 
Jsx(, 9) -(=ন) Jeo [= ড় ~ ai) +20 - ») | dPS(aib,). (IT) 


Expressed in terms of the (r, 0) sample values, Eq. (110) and Eq. (111) become 


ঠ 1 2 2 Lu fe 
Alsi (ন) UH i [= ন al 
and 
svt, 0) = 53) Hele] 5 1 FR pn 08 1) 
1S: 27N) te Mh ক \ IN 4 Ti = 2r,/, cos (0; i (113) 
dFS(ti, . .  , bya) 


The factors Tir; are introduced because they are the Jacobian of the transformation 
from the x, y sampling plan to the r, 0 sampling plan [16].* 
The probability density function for r alone, i.e., the density function for the 


output of the detector, is obtained simply by integrating the density functions forr 
and 0 with respect to 0. 


NO) = f J ee f fri, 0;) dl; dy - dl, jo, 
0) ) (0 
or (114) 
1 nf? nl2 EE 

-(t Hecel- 3) 
and 

fs) =| | ed | fsx, 0;) dO, dO, - - "dla, 
ay 0 0 0 


fe fz EE nz (rif; 
fsx) = (5) fh Ll ri exp [- NBA +l I ul) dF(%i, $2 °° dnfas) 
or (115) 


Lf Unie a2 f 1 uz 
syn) = (5) In nll E EA +/9]: 
BS শা চা 


* দৰ | a 
For example, in two dimensions, [xOr, 9) de dy = flr, Or dr dO. 
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density for r is completely independent of the dis- 


Notice that the probability 
formation about the phase of the signals has been 


tribution which the ¢; had; all in 


lost. 
The likelihood ratio for a video input, (1), is 
fey [ 1 U2 nl? nif 
I) = FT = ex বাল 2 i>. 4 
=A 2 | IH nl ). a6 
Again it is more convenient to work with the logarithm of the likelihood ratio. Thus, 
1 nl2 es Ww E 
x St J UO Ho and (117) 
|) n/2 rifi 
15100) == স্পা পঁ Snr (2%). 
RE ol NV (118) 
which is approximately 
E Of 
In I(D] = — NH + vf In Al N | dt. (119) 


The function In Iy(2) is approximately the parabola 2*/4 for small values of 
and is nearly linear for large values of #. Thus, the expression for likelihood ratio might 


be approximated by 


EW [To 
Inl(O] = - NW + INE FOOPLS OF at (120) 
u “ “0 
for small signals, and by 
oT 
Int(O)] = ©, t+ 6 (0) f (0) dt (121) 
“0 


for large signals, where C; and Cs are chosen to approximate In lo best in the desired 


TAnge. 
The integrals in Eqs. (120) and (121) can be interpreted as cross correlations. 


Thus the optimum receiver for weak signals is a square law detector, followed by a 
correlator which finds the cross correlation between the detector output and [FOP 
the square of the envelope of the expected signal. For the case of large signal to noise 
ratio, the optimum receiver is a linear detector, followed by a correlator which has for 
its output the cross correlation of the detector output and f (1), the amplitude of the 


expected signal. 


The distribution function for I(r) cannot be found easily in this case. The 


on developed here will apply to the receiver designed for low signal to noise 
of most interest in detection studies. An analogous approxi- 
| to noise ratios would be even easier to derive. 

he mean and standard deviation for the distribution of the 


shown above, 


approximati 
ratio, since this is the case 
mation for the large signa 

First we shall find t 
logarithm of the likelihood ratio as 


nl2 


I 2 I 9" 
MIG) = = RX + FN > rf, (122) 


=1 
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for the case of small signal to noise ratio. The probability density functions for each 


ri are 
ri +f rifi 
svn) = FP | - "NN | |. 
and (123) 


Vr; [i 
ENN) = NP E হন] “ 


The notation g(r) and gs (ri) is used to distinguish these from the joint distributions 


of all the r; which were previously called f(r) and fs~(r). The mean of each term 
rEf?/4N? in the sum in Eq. (122) is 


22 2 0D 2 
Gh RB fn f 024) 
ns (25) = 7 NSS) dr;, 
i A) fen (CES 
Le — | =~ NP |= = — I= J dis 
HSN 4N2 4N Jo N° 2N ol 
Similarly, | i , (124) 
262 6D 2 0 3 pe! 
hi) oR for sifu AER 
(I) FN Jo NEM) dri = iy | Nze*p [- a a: 


The second moment of each term r?f2/4N? is 


L rif lh rt (ei) dr 
lis 16N' 16N2 Jo NE8SNri) dr;, 


or 
riffs fr @ 5 (0+ 2) rift) 
relia) = if J, Free [= CEP) an 
(125) 
4rd 4 Po 
hs rifi fi rs 
Similarly, lr) FINE J, NES) ds 
474 4 fo 5 2 
fi) fh 3 ri 
or (TE) = TEN? kb Ws e*P [- | dr;. 
The integrals for the case of noise alone can be evaluated easily: 
nl ) fe 
4N* 2N° 
and (126) 
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polynomial. The required formulas are collected in convenient form in Threshold 
Signals [5] on page 174. The results are 


bis (127) 
Since 
রি c%(Z) = (Z°) — LZ), (128) 
the variances of rif?/4N* are 
(129) 


and 


For the sum of independent random variables, the mean is the sum of the means 


of the terms and the variance is the sum of the variances. Therefore the means of In I(r) 


are n/2 > (5 fe ন) bl fi 


Hsylini@] = — DH +L T1N) “ia 
and R fr 0 টি ft (130) 
IN TAN 
and the variances of In I(r) are t 
olin 00] = ( ৰ + 5 £) 
and ath (131) 


cn 10)] => HN } 


1) can be assumed to be normal, they can be 


If the distribution functions of In I( 
d standard deviation of the logarithm of 


obtained immediately from the mean an 


likelihood ratio. 

Let us consider the case in which the incoming signal is a rectangular pulse which 
is M/W seconds long. * The energy of the pulse is half its duration times the amplitude 
squared of its envelope, for a normalized circuit impedance of one ohm. 

e distribution for the sum of M independent random vari- 
y function f(z) = z exp [-@Me: + a*)]l(azx) arises in the 
unpublished report by J. I. Marcum, A Statistical Theory of Target Detection by Pulsed Radar: 
Mathematical Appendix, Project Rand Report R-113. Marcum gives an exact expression for 
this distribution which is useful only for small values of M, and an approximation in Gram- 

than the normal approximation given here. Marcum's 


Charlier series which is more accurate 
expressions could be used in this case, and in the case presented in Section 4.6. 


* The problem of finding th 
ables, each with a probability densit 
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Thus of the WT numbers (fi), there are M consecutive ones which are not zero. These 
are given by 


(132) 


Where E is the pulse energy at point A in Fig. 5 in the absence of noise. For this case, 
Eq. (130) and Eq. (131) become 


1 E:? 
tsslIn I(r)] = HN , 
Asin I(F)] = 0, is 


2 


of yn / eS (: + 
Ss ol MN; \ M ye) | 
and 


9 


clin I0)] = bel 
A MN: 


75h 
0 

The distribution of In I(r) is approximately normal if M is much larger than one, 
for, by the central limit theorem, the distribution of a sum of M independent Inco 
variables with a common distribution must approach the norma] distribution as 
becomes large. The actual distribution for the case of noise alone can be calculated in 
this case, since the convolution integral for the g (ri) with itself any number of times 
can be expressed in closed form. The distribution of In I(r) for signal plus noise is 
nearly normal than its distribution with noise alone, since the distributions gs ~(ri 
are more nearly normal than ¢ (ri). ts 

The receiver operating characteristic for the case M = 16 is plotted in Fig. 
using the normal distribution as approximation to the true distribution. In many cases 
it will be found that 


1 
eT (134) 
M N, 

In such a case the distributions have a 


Pproximately the same variance, Assuming 
normal distribution then leads to the cu 


rves of Figs. 2 and 3, with 


Hf n(n). (135) 


4.8 A radar case 


This section deals with detecting a radar 
Assume that the signal, if it occurs, consis 
rence and envelope Shape are known. 
a uniform distribution for eac 
incoherent. 


The set of signals can be described as follows: 


M1 
)= > f0+m)cos(wt+ 0), (136) 


m0 


target at a given range. That is, we shall 
ts of a train of M pulses whose time of oo 
The carrier phase will be assumed to ত 
h pulse independent of all others, i.e., the pulses ar 


where the M angles 06, have independent uniform distributions, and the function f; 
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99.9 


99:5 


[=] 
100 Fin (1) 


0.1 


20 30 40 50 60 70 80 90 95 


01 03051 2 4 6810 
100 Fy (1) 


FIGURE 6 


Broad band receiver with optimum video design, M = 16. 


Receiver operating characteristic. 
a single pulse, has the property that 

টা Ea 2E 

f fUt+infUu +jn)dt= je (137) 
a function, which is zero if i # J. and unity if 1 = J. 
1 between pulses. Eq. (137) states that the pulses are spaced far 
thogonal, and that the total signal energy is E.* The function 
ents as high as w|2r. 


which is the envelope of 


( 


where 0,; is the Kronecker delt 
The time +7 is the interva 
enough so that they are or 
{0) is also assumed to have no frequency compon 
+ The factor 2 appears in (137) because f()is the pulse envelope: the factor M appears 


because the total energy Eis M times the energy of a single pulse. 
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The likelihood ratio can be obtained by applying Eq. (56). Then 


( S 3 Pr 
IC =] exp E =] exp ES s(t)(r) l dPs(s) (138) 
R Ny 


' 
No Ju 


2 Pru 
exp E2 | pS f(t + mr)t(t) cos (wrt + 0) aan r= M0p ye (139) 
00 


m= 


The integral can be evaluated, as in Section 4.5, yielding 


E11! \ 
I(t) = exp [- x TI u(z) হি (140) 


0) m=0 
where 


AS 2 | [ 2 | a 2 
|= (t+ mr)h(t)coswrd| + |= (t + mr)e(t) sin wf dt |. 
) [Fl J i id Ny Jo / 4) 


This quantity r,, is almost identical with the quantity r which appeared in the 
discussion of the case of the signal known except for carrier phase, Section 4.5. In fact, 
each r,, could be obtained in a receiver in the manner described in that section. The 


quantity ro is connected with the first pulse; it could be obt 


ained by designing an ideal 
filter for the signal 


501) = 0) cos (ot + 0) (142) 


for any value of the phase angle 0, and putting the output through a linear detector. 
The output will be (No/2)r/N at some instant of time 1, which is determined by the time 
delay of the filter. The other quantities r,, differ only in that they are associated with 
the pulses which come later. The output of the filter at time to + mr will be (Nol2rulN. 

It is convenient to have the receiver calculate the logarithm of the likelihood 
ratio, 


E dit চি 
PES ru (143) 
In Ce) NE 2 In ule ) | 


Thus the In /(r,,/N) must be found for each r 
As in the previous section, r,,/N will usuall 
approximated by 2/4. The quantities I(r 
detector rather than 
times toto +r. 


mw and these M quantities must be added. 
y be small enough so that In I(t) can be 
ulN)* can be found by using a square law 
a linear detector, and the outputs of the square law detector at 
+-,to + (M -— l)r then must be added. The ideal system thus 
consists of an IF amplifier with its passband matched to a single pulse,* a square law 
detector (for the threshold signal case), and an integrating device. 


We shall find normal approximations for the distribution functions of the 
logarithm of the likelihood ratio using the approximation 


in nl) ~ (144) 


* It is usually most convenient to make the ideal filter (or an approximation to it) a part 
of the IF amplifier. 
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alues of r/N.* Substitution of (144) into (143) yields 
E  M=1fruk 
ni -7+ a(x): 5 
No 0 4\N 0) 


The distributions for the quantities r,, are independent: this follows from the fact that 
the individual pulse functions f(t + mr) cos (wt + 0,,) are orthogonal. The distribu- 
tion for each is the same as the distribution for the quantity r which appears in the 
discussion of the signal known except for phase; the same analysis applies to both cases. 


Thus, by Eq. (83)t 
rm |NoM a 
oA =) = |-ত 


rm a*NoM 
Pl =a =exp | -3E | (146) 


which is valid for small v 


and by (89), 


NoMrm E ঞ্ঃ ua? 2E 
Ps "Ey = exp ্ J « exp ত Ila NM dx, (147) 


or 
Im > NoM E f © of _SNoM) Ga 
Pevlye 24) = 2E PL RM), “P 4E Jute 


The density functions can be obtained by differentiating (146) and (147): 


) MNo frm _ (rn) {NoM 
(মন =p yi) FP LANNE J 
rm MNofrm E _ ন NoM ্) 
za ) - (7) SP [- A er [ (” ( AE il + 


N 


ppeared in the previous section. The 


thematically, as a 
m of the likelihood ratio can be found 


This is the same situation, ma 
for the logarith 


standard deviation and the mean 
in the same manner, and they are 
E: 


n(n!) = 0, 
E* 2E (149) 
osx(nl) = NE MN)’ 
a ES 
and onl) = NE “ 


If the distributions can be assumed normal, they are completely determined by 
ances. These formulas are identical with the formulas (133) of the 
same, mathematically, and the discussion and 
t the end of Section 4.7 apply to both cases. 


their means and vari 
previous section. The problem is the 
receiver operating characteristic curves a 


* See the footnote below equation (131). 
4 The M appears in the following equations because the energy of a single pulse is EIM 


rather than E. 
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4.9 Approximate evaluation of an optimum receiver 


In order to obtain approximate results for the remaining two cases, the assump- 
tion is made that in these cases the receiver operating characteristic can be approxi- 
mated by the curves of Figs. 2 and 3, i.e., that the logarithm of the likelihood ratio is 
approximately normal. This section discusses the approximation and a method for 
fitting the receiver operating characteristic to the curves of Figs. 2 and 3. 

By (68), Fy (l) can be calculated if FA(l)is known. Furthermore, it can be seen 
that the nth moment of the distribution Fy(l) is the (1 — 1th moment of the distribu- 
tion Fs.y(l). Hence, the mean of the likelihood ratio with noise alone is unity, and if 
the variance of the likelihood ratio with noise alone is c¥, the second moment with 
noise alone, and hence the mean with signal plus noise, is 1 + cX.. Thus the difference 
between the means is equal to o%, which is the variance of the likelihood ratio with 
noise alone. Probably this number characterizes ability to detect signals better than 
any other single number. 


Suppose the logarithm of the likelihood ratio has a normal distribution with 


noise alone, i.e., 
|| ” (rr — mj): 
FN) = | ex Le (150) 
V2indJmi P | 2d IE # 


Where m is the mean and d the variance of the logarithm of the likelihood ratio. The 
Ath moment of the likelihood ratio can be found 


as follows: 


E |] s ee 2 
y(n) =| ME) = — [ exp [nr] exp [- + ollie 3 | dr, (151) 
[1 0 


V27nd 2d 


where the substitution { = exp + has been made. The integral can be evaluated by 
completing the square in the exponent and using the fact that 


f Gs exp [- | dt = V2nd. 


Thus, (152) 


ned 
As(") = exp |S + mn |. 


In particular, the mean of I(x), which must be unity, is 


1 
“s() = 1 = exp E + ". (153) 
and therefore ll 
m= = = (154) 
সু 


The variance of I(+) with noise alone is o#., and therefore the second moment of I(1) is 
HP) = [ns(DP + 640) = 1 + 30, 055) 
and this must agree with (152). It follows that 


AMP) = 1 + of = exp [2d + 2m] = exp [d], (156) 
and therefore hi Pp 1] = exp [d] 


d=iIn(d + 0). (157) 
The distribution of likelihood ratio with signal plus noise can be found by 
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applying Eq. (68). Thus i 
dFsy(l) = LdFs(l), 


H (158) 
Fs) = — | 1 dFn(l). 
Rl 
If dFN(l) is obtained from Eq. (150) and lis replaced.by exp @, then 
1 f= @ + df2)? 
Fs) = === xp LJ exp | — 2 
1 -J্[-দ]* 
or (159) 
1 ub  — al2)* 
Fell) = | exp | -— | de. 
snl) hl 2d IE 


Thus the distribution of In [is normal also when there is signal plus noise, in this case 


with mean d/2 and variance d. 


In summary, it is probable that the variance c#- of the likelihood ratio measures 


ability to detect signals better than any other single number. If the logarithm of 
likelihood ratio has a normal distribution with noise alone, then this distribution and 
that with signal plus noise are completely determined if oR is given. The distribution of 
In 10) is normal in both cases. Its variance in both cases is d, whichis also the difference 
of the means. The receiver operating characteristic curves are those plotted in Fig. 2, 
with the parameter d related to o% by the equation 

d = In(l + 3). (160) 
this is the distribution which occurs. In 
of Section 4.6, Section 4.7, and Section 4.8 this distribution is found to be the 
limiting distribution when the number of sample points is large. Certainly in most 
cases the distribution has this general form. Thus it seems reasonable that useful 
approximate results could be obtained by calculating only cf for a given case and 
assuming that the ability to detect signals is approximately the same as if the logarithm 
of the likelihood ratio has a normal distribution. On this basis, o#(1) is calculated 
in the following sections for two cases, and the assertion is made that the receiver 
operating characteristic curves are approximated by those of Fig.2withd = In(l + 68). 


In the case of a signal known exactly, 
the ce 


4.10 Signal which is one of M orthogonal signals 


Suppose that the $s 
which have the same probability, 


nq 
| Si(0)sa(t) dt = Ebd;y- (161) 


“0 


et of expected signals includes just M functions s,(1), all of 
the same energy E, and are orthogonal. That is, 


Then the likelihood ratio can be found from Eq. (56) to be 


Ml E IL 
Kn) Er ne [- exp Eo as y 
(162) 


or 
) | M 


where sp; are the sample values of the function s(1). 
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With noise alone, each term of the form 


n 
(1/N) S xsi 
= 


has a normal distribution with mean zero and variance 


Furthermore, the M different quantities 


1\ an 
(3) Sr 


are independent, since the functions S(t) are orthogonal. It follows that the terms 


1) 2 E 
exp (5) 2s = x 


0. 
are independent. 


Since the logarithm of each term 


2 - [ (3) ee — 


0. 
has a normal distribution with mean (—-EINo) 


and variance 2E/N,, the moments of the 
distribution can be found from Eq. (152). Th: 


Ce nth moment is 

£ 

AMZ") = exp [ne 1) z E (163) 
lj HN 


It follows that the mean of each term is unity, and the variance is 


TMZ) = (22) — [AZ] = exp EA -1. (164) 
No 


The variance of a sum of inde 


pendent random variables is the sum of the variances of 
the terms. Therefore 


2E 
(MI) = wl exp (x) _ । | | (165) 
N, 
and it follows that the variance of the likelihood ratio is 
1 2E 
(Ty =. 2) = (166) 
o-oo) - 


It was pointed out in Section 4.9, that the receiver operating characteristic 
Curves are approximately those of Fig. 2, with 


1 1 2E 
l ee Ltn 2E (167) 
d= In(l + 08) ni M+ MP (z)] | 


This equation can be Solved for 2E|NG: 


2E 
FZ = nl + Mee _ 1)]. (168) 
No 


* The reasoning is the same as that in Section 4.4. 
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Suppose it is desired to keep the false alarm probability and probability of detec- 
tion constant. This requires that d be kept constant. Then from Eq. (168) it can be 
seen that if the number of possible signals M is increased, the signal energy E must also 


be increased. 


4.11 Signal which is one of M orthogonal signals with unknown carrier phase 

Consider the case in which the set of expected signals includes just M different 
amplitude-modulated signals which are known except for carrier phase. Denote the 
signals by 
S(t) = f(D) cos (wt + 0). (169) 
It will be assumed further that the functions f(t) all have the same energy E£ and are 


orthogonal, i.e. 


- 0 
[rorno dt = 2ESd,a (170) 


where the 2 is introduced because the f's are the signal amplitudes, not the actual signal 
functions. Also, let the fi(f) be band-limited to contain no frequencies as high as «w. 
Then it follows that any two signal functions with different envelope functions will be 
Let us assume also that the distribution of phase, 0, is uniform, and that 


nvelope function is 1/M. 
tions, the likelihood ratio can be obtained from Eq. (56), and 


orthogonal. 
the probability for each e 
With these assump 


it is given by Tn i ন i 
0) = M2 দ l exp EE 2 — zl a0, (7 


ple values of 5), and hence depend upon the phase 0. The 


where sp, are the sam 
own except for phase, and the 


integration is the same as in the case of the signal kn 
result, obtained from Eq. (82), is 


LE E re 
0-7 de [alu Miss. 


Fe = f (> wifi) cos ou) + (ঢু wifi) sin ou). (173) 


to find o#(1). The variance of each term in the sum in Eq. 
he distribution function with noise alone can be found in 


Section 4.5. Since the fi(1) are orthogonal, the distributions of the ri. are independent, 
and the terms in the sum in Eq. (172) are independent. Then the variance of the likeli- 
hood ratio, o#(/), is the sum of the variances of the terms, divided by M£®. 

The distribution function for each term exp (—E|No)lo(ri/N) is given in Section 


4.5 by Eqs. (84) and (85). If ois defined by the equation 


E 2E 
oo [EE চর 


the distribution function in the presence of noise for each term in Eq. (172) is 


Ft) = os a 
N(B) = exp তব |* (175) 


where 


Now the problem is 
(172) can be found since t 


then 
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The mean value of each term is 


ও Ee E 2E a2 
RDB) =| B AF (B) | er | ERE Je [ 5 | dx. (176) 
0 0 [1 \ 0 


This can be evaluated as on page 174 of Threshold Signals [5], and the result is that 
u(B) = 1. 
The second moment of each term is 


uP (82) ol B2 AFP (8), 
0 
or ib 


EE ME AE 
HX (BS) = fs exp Ne Al | x exp ES da. 


The integral can be evaluated as in Appendix E of Part II of reference [17], and the 
result is 


oh (2E 
nu (82) a ul): (178) 
The variance of each term in Eq. (172) is 
SE 
RDO = nin (2) — at (OP = A(T) — 1. ls 


It follows that the variance of Ml! is 


2E 
ch.MD= wu uli) EY ' (180) 
0 
and therefore 
IT (26 
d0-Hlaln) =) i 


since the variance for the sum of independent random variables is the sum of the 
variances. 


If the approximation described in Section 4.9 is used, the receiver operating 
characteristic curves are approximately those of Fig. 2, with 


JE TE 
2 EE REEL (182) 
d =n + 6%) ni EY u(x) 


4.12 The broad band receiver and the optimum receiver 


A few applications of the results of Section 4 are suggested in Table I, SECtiOn 
4.1. Two further examples of practical knowledge obtainable from the theory are 
Presented in this section and in the next. র্‌ 
One common method of detecting pulse signals in a frequency band of wat! 
Bis to build a receiver which covers this entire frequency band. Such a receiver with ডু 
pulse signal of known starting time is studied in Section 4.7. This is not a truly opti- 
mum receiver; it would be interesting to compare it with an optimum receiver. We 
have been unable to find the distribution of likelihood ratio for the case of a signal 
which is a pulse of unknown carrier phase if the frequency is distributed evenly 
band. However, if the problem is changed slightly, so that the frequency is restricte. 
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to points spaced approximately the reciprocal of the pulse width apart, then pulses 
at different frequencies are approximately orthogonal, and the case of the signal which 
is one of M orthogonal signals known except for phase can be applied. Eq. (182) 
should be used with M equal to the ratio of the frequency band width B to the pulse 
band width. Since the band width of a pulse is approximately the reciprocal of its pulse 
width, the parameter M used in Section 4.7 also has this value. Curves showing 
2E/N,p as a function of d are given in Fig. 7 for both the approximate optimum receiver 


Optimum 
M = 200 
M = 50 
10 রি 
9 
a 
Ls 
SB OB 10 1% 1 16 CB 00 RL Bh 
ad 
FIGURE 7 


Comparison of optimum and broad band receivers. 
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and the broad band receiver for several values of M. In the figure, dis calculated from 
Eq. (135) and Eq. (182), which hold for large values of M. 


4.13 Uncertainty and signal detectability 


In the two cases where the signal considered is one of M orthogonal signals, the 
uncertainty of the signal is a function of M. This provides an opportunity to study the 
effect of uncertainty on signal detectability. In the approximate evaluation of the opti- 
mum receiver when the signal is one of M orthogonal functions, the ROC curves of 
Figs. 2 and 3 are used with the detection index d given by 


1 1 2E 
= ae St em 167) 
d m1 mM + (F) |. ( 
This equation can be solved for the signal energy, yielding 
2 


=n —-M + Me] SInM + Inet — 1) 
No 


(175) 
the approximation holding for large 2E/No.* From this equation it can be seen that 
the signal energy is approximately a linear function of In M when the detection index d, 
and hence the ability to detect signals, is kept constant. It might be suspected that 
2E/N/ is a linear function of the entropy, 2X pin pi, where p, is the probability of the 


ith signal. The linear relation holds only when all the Pi are equal. The expression 
which occurs in this more general case is: 


25 
FT > nC pf) + Ine — D. (176) 
No 
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FOUNDATIONAL ASPECTS OF THEORIES OF MEASUREMENT! 


DANA SCOTT and PATRICK SUPPES 


1. Definition of measurement. It is a scientific platitude that there 
can be neither precise control nor prediction of phenomena without measure- 
ment. Disciplines as diverse as cosmology and social psychology provide 
evidence that it is nearly useless to have an exactly formulated quantitative 
theory if empirically feasible methods of measurement cannot be developed 
for a substantial portion of the quantitative concepts of the theory. Given 
a physical concept like that of mass or a psychological concept like that of 
habit strength, the point of a theory of measurement is to lay bare the struc- 
ture of a collection of empirical relations which may be used to measure the 
characteristic of empirical phenomena corresponding to the concept. Why 
a collection of relations? From an abstract standpoint a set of empirical 
data consists of a collection of relations between specified objects. For 
example, data on the relative weights of a set of physical objects are easily 
represented by an ordering relation on the set; additional data, and a 
fortiori an additional relation, are needed to yield a satisfactory quantitative 
measurement of the masses of the objects. 

The major source of difficulty in providing an adequate theory of measure- 
ment is to construct relations which have an exact and reasonable numerical 
interpretation and yet also have a technically practical empirical inter- 
pretation. The classical analyses of the measurement of mass, for instance, 
have the embarrassing consequence that the basic set of objects measured 
must be infinite. Here the relations postulated have acceptable numerical 
interpretations, but are utterly unsuitable empirically. Conversely, as we 
shall see in the last section of this paper, the structure of relations which 
have a sound empirical meaning often cannot be succinctly characterized 
S0 as to guarantee a desired numerical interpretation. 

Nevertheless this major source of difficulty will not here be carefully 
scrutinized in a variety of empirical contexts. The main point of the present 
paper is to show how foundational analyses of measurement may be grounded 
in the general theory of models, and to indicate the kind of problems relevant 


to measurement which may then be stated (and perhaps answered) in a 
precise manner. 
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Before turning to problems connected with construction of theories of 
measurement, we want to give a precise set-theoretical meaning to the 
notions involved. To begin with, we treat sets of empirical data as being 
(finitary) relational systems, that is to say, finite sequences of the form 
YU = <A, Ri, ..., Rn), Where A is a non-empty set of elements called the 
domain of the relational system %, and Rj, ...,R, are finitary relations 
on A. The relational system % is called finite if the set A is finite; otherwise, 
infinite. It should be obvious from this definition that we are mainly 
considering qualitative empirical data. Intuitively we may think of each 
particular relation R; (an m;-ary relation, say) as representing a complete 
set of “‘yes’’ or “‘no” answers to a question asked of every m;-termed se- 
quence of objects in A. The point of this paper is not to consider that aspect 
of measurement connected with the actual collection of data, but rather 
the analysis of relational systems and their numerical interpretations. 

If s = mj, ...,mM,) is an n-termed sequence of positive integers, then 
a relational system % = <A, Ri, ..., Ry) is of types if foreach? = lL, ...,.m 
the relation Ri, is an m,-ary relation. Two relational systems are similar 
if there is a sequence s of positive integers such that they are both of type s. 
Notice that the type of a relational system is uniquely determined only it 
all the relations are non-empty; the avoiding of this ambiguity is not 
worthwhile. Suppose that two relational systems % = A, IR, eicesadtnt BAG 
B = <B, 5, ...,S5n> are of type s = <n, ...,mMn). Then % is a homo- 
morphic image of WU if there is a function { from A onto B such that, for each 
{=1,...,n and for each sequence Cay, ..., dm? of elements of A, 
Ril, cit) 3 and only if Si(f(a), -- ., f(am)). Tf the function f is one- 
one, then Y is an isomorphic image of %, or simply % and YB are isomorphic. 
YY is a subsystem of B if A 2 B and, for each 1 = 1, ..., 1, the relation R; 
is the restriction of the relation 5S; to A. UA is imbeddable in YB if some sub- 
system of Bis a homomorphic image of 9. Le numerical relational system 
is simply a relational system whose domain of elements is the set Re of all 

onment for a relational system YA with respect 


real numbers. A numerical assig Sten 
9 is a function which imbeds % in %. 


to a numerical relational system 
required to be one-one. 


A numerical assignment is not 
Within the framework of the preceding formal definitions it is now 
possible to give an exact characterization of a theory of measurement. 
theory are determined by fixing a 


First of all the. general outlines of a 
positive integers and only considering relational systems 


finite sequence Ss of i € e 
of type s. Next a numerical relational system % of type s is selected which 
mmm 

2 Although in most mathematical contexts imbeddability is defined in terms of 
isomorphism rather than homomorphism, for theories of measurement this is too 
restrictive. However, the notion of homomorphism used here is actually closely 
connected with isomorphic imbeddability and the facts are explained in detail in 


Section 2. 
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corresponds to the intended numerical interpretation of the theory, and 
only relational systems imbeddable in % are permitted. Moreover the 
theory need not concern all relational systems of type s imbeddable in % 
but only a distinguished subclass. Since it is reasonable that no special set of 
objects be preferred, we require that the distinguished subclass be closed 
under isomorphism. We thus arrive at the following characterization of 
theories of measurement as definite entities: a theory of measurement 1s 
a class K of relational systems closed under isomorphism for which there 
exists a finite sequence s of positive integers and a numerical relational 
system % of type s such that all relational systems in K are of type s and 
imbeddable in %. 3 

Some readers may object that the definition of theories of measurement 
should be linguistic rather than set-theoretical in character, since a theory 
is ordinarily thought of as a linguistic entity. To be sure, many theories 
of measurement have a natural formalization in first-order predicate logic 
with identity. Notice, however, that first-order axioms by themselves are 
not adequate, for if they admit one infinite relational system as a model 
then they have models of every infinite cardinality, and it is difficult to see 
how any natural connection can be established between numerical models 
and models of arbitrary cardinality. Even neglecting this criticism first- 
order axioms are not adequate to express properties involving arbitrary 
natural numbers, for example, that a relational system is finite or that as 
an ordering it has Archimedean properties. Any linguistic definition of 
theories which will permit expression of these more general properties would 
require extensive machinery and be immediately involved in some of the 
deepest problems of modern metamathematics. On the other hand, we do 
not wish to give the impression that we reject any linguistic questions. 
In fact, we use our set-theoretical definition as a point of departure for asking 
Just such questions. 

On the basis of the definition of theories of measurement adopted, two 
questions naturally arise, to each of which we devote a section. In the 
first place, is a given class of relational systems a theory of measurement? 


And in the second place, given a theory of measurement, in what sense can 
it be axiomatized? 


2. Existence of measurement. A simple counterexample shows that 
not every class of relational systems of a given type closed under isomorphism 
is a theory of measurement. Let O be the class of all relational systems of 
type <2) that are simple orderings. Let <A, R) be a system in O where R 


3 In some Contexts we shall say that the class K is a theory of measurement of type 
s relative to %M. Notice that a consequence of this definition is that, if Kis a theory 
of measurement, then so is every subclass of K closed under isomorphism. Moreover, 
the class of all systems imbeddable in members of K is also a theory of measurement 
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well-orders A and A has a power not equal to or less than that of the con- 
tinuum. Such a relational system can be proved to exist even without the 
help of the axiom of choice, but of course with aid of this axiom the existence 
is obvious. By way of contradiction suppose that O is a theory of measure- 
ment relative to a numerical relational system <Re, S>. From the definition 
it follows that <A, R) is imbeddable in Re, S) and that there is a numerical 
assignment f mapping A onto a subset of Re such that xRy 1f and only 1f 
f(x) S f(y) for all elements x,y eA. Let a, b be elements of A such that 
{(a) = f(b). From the hypothesis that Ris a simple ordering, we can assume 
without loss of generality that aRb. Hence, we have f(a) S f(b), and then 
{(@) S f(a), and finally bRa. Ris antisymmetric, and so a = b. This argument 
shows that the function { is one-one. Hence A has the same power as a 
subset of Re, which is impossible. This proof shows that every theory 
of measurement included in the class O contains only relational systems of 
power at most that of the continuum. It is an unsolved problem of set- 
theory closely connected with the continuum hypothesis whether the class O 
restricted to systems of power at most that of the continuum is actually 
a theory of measurement. 4 At least it can be very easily shown that 0 so 
restricted is not a theory of measurement relative to the system <Re, S>, 
where the relation < is the usual ordering of the real numbers. 5 Indeed, 
the exact condition that a relational system in O must satisfy to be im- 
beddable in <Re, <) is not really elementary, and the proof of the necessity 
involves the axiom of choice. $ 

Let 0’ be O restricted to countable relational systems. 7 Jt was proved 
by Cantor that 0 is a theory of measurement relative to Re, <)>, to formu- 
late somewhat irreverently his classical result in the terminology of this 
paper. This restriction to countable relational systems is always sufficient. 
For it can be shown that the class of all countable relational systems of a 
given type is a theory of measurement; however, the numerical relational 
system required is so bizarre as to be of no practical value. 

A primary aim of measurement is to provide a means of convenient 
computation. Practical control or prediction of empirical phenomena 
requires that unified, widely applicable methods of analyzing the important 
relationships between the phenomena be developed. Imbedding the dis- 


4 In this connection see Sierpinski [5], Section 7, pp. 141 ff., in particular Proposition 


C;5, where of course different terminology is used. 
here to consider a relational system isomorphic to the ordering of 


5s It is sufficient 
the ordinals of the second number class or to the lexicographical ordering of all pairs 
of real numbers. . 

beddable in (Re, <) if and only if it contains a countable 


s A simple ordering is im 
dense subset. For the exact formulation and a sketch of a proof, see Birkhoff [1], 


PP. 31-32, Theorem 2 
7 The word ‘countable’ means at most denumerable and it refers to the cardinality 


of the domains of the relational systems. 
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covered relations in various numerical relational systems is the most im- 
portant such unifying method that has yet been found. But among the 
morass of all possible numerical relational systems only a very few are of 
any computational value, indeed only those definable in terms of the 
ordinary arithmetical notions. From an empirical standpoint most sets 
of qualitative data can find numerical interpretation by relations defined 
in terms of addition and ordering alone. By way of example we may cite 
the measurement of masses, distances, sensation intensities, and subjective 
probabilities. Frequently the consideration of weighted averages requires 
also the use of the multiplication of numbers. However, in the examples 
given in this paper we shall restrict ourselves to the notions of addition and 
ordering. 

No natural scientific situation would seem strictly to require the con- 
sideration of sets of infinite data. This state of affairs suggests that theories 
of measurement containing only finite relational systems would suffice for 
empirical purposes. The problem is delicate, however, for the measurement 
of a meteorological quantity such as temperature by an automatic recording 
device is usually treated as continuous both in its own scale and in time. 
Yet the important problem of measurement does not really lie in the correct 
use of such recording devices but rather in their initial calibration, a process 
proceeding from a finite number of qualitative decisions. Because of the 
awkwardness of the uniform application of finite relational systems, we 
shall not generally make this restriction. 

Further remarks about establishing the existence of measurement are 
best motivated by reference to a concrete example. In a recent paper [4], 
Luce has introduced a generalization of simple orderings which he calls 
seniorders. A semiorder is a relational system <A, Py of type <2) which 
satisfies the following axioms for all Hi Rs Wire Ae 

Sl. Not #Px. 


S2. If xPy and zPw, then either xPw or 2zPy. 
S53. If xPy and z2Px, then either wPy or zPw.8 


Such relations are most likely to occur in situations where objects are to 
be arranged in order and where it is difficult to say exactly when two 
Objects are indifferent. For example, to say that xPy might be interpreted 
as meaning that the pitch of the sound x is definitely higher than the pitch 
of y, or that the hue of color x is definitely brighter than the hue of color y, 
or that the weight of the Object x is noticeably greater than that of y, etc. 
Indifference between two objects x and y (in symbols: x1y) is defined as 
not xPy, and not yPz. The point of Luce'’s axioms is that the relation I of 


8 See [4], Section 2, Dp. 18 
P 


1. The axioms given here are actually a simplification 
of those given by Luce. 
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indifference is not always transitive, a fact easily appreciated for each of 
the intuitive interpretations given above. 

In his paper Luce gives a certain numerical interpretation for certain 
kinds of semiorders, but he does not show that any particular class of 
semiorders is a theory of measurement in the sense used here, because his 
interpretations are not relative to a fixed numerical relation. However, in 
the finite case the situation becomes relatively simple. Let > be that 
relation between real numbers defined by the condition: #> y if and 
only if x > y+1. Clearly, if x and y are real numbers such that > y, 
then it is fair to say that x is definitely greater than y, Or better, x iS noticeably 
greater than y. It is in fact a simple exercise to prove that the relational 
system <Re,>) is a semiorder. Further we shall give the proof of the 


following result: 

The class of finite semiorders is a theory of measurement 
numerical relational system <Re, >). 

Before presenting the proof of the above, it would be well to outline a 
general method in proofs of the existence of measurement which we shall 
call the method of cosets. Let UY = (A, Ry, ..., R,) be a relational system 
.,m,). A uniquely determined equivalence relation E is 
he condition: xEy if and only if for each 4 = 1 act 
+s Wm) Of m,;-termed sequences of elements 
ins Mie Rigg «us Sm) 


relative to the 


of type Cm, . 
introduced into % by t 
and each pair C2, Zs COjs < 
of A,ifz, # w; implies {2,, w,} = {x,y} for] = 1, - 
if and only if Rilo, ..., Wm). 

Even though the above definition is complicated to state in general, the 
meaning of the relation xEy is simple: elements x and y stand in the relation 
E just when they are perfect substitutes for each other with respect to all 


the relations Ri. * 
The notion of a weak ordering can serve as an example. Let % = <A, R) 
n R is connected and transitive. Then xEy is 


ent to the condition: For all ze A, xRz if and only if yRz, and zRx 
zRy. However, this simplifies finally to: xRy and yRs2. 
Returning now to the general case, define, for each x € A, [%] to be the 
class of all y such that xEy. [x] is called the coset of x. Let A* be the class 
of all [x] for x e A. Directly from the definition of E we can deduce that it is 
permissible to define m,-ary relations R: over A* such that, for all xj, ..., 
LE Reed, - ce, [rnd) tf and only if Ril, -- + mu). The relational 
system Y%* = MER sas R*) is called the reduction of Y by cosets. 
It is at once obvious that Y* is a homomorphic image of A and that YA#* 
Y*. What is not quite obvious is the following: 
Y* 1s a homomorphic image of VB. 


where the binary relatio 
‘equival 
if and only if 


is isomorphic with 
If Bisa homomorphic image of %, then 


9 The authors are indebted to the referee for pointing out the work by Hailperin 
in [3] which suggested this general definition. 
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By way of proof, let f be a homomorphism of %{ onto B. We wish to 
show that if f(x) = f(y), then [x] = [y]. Instead of the general case, assume 
for simplicity that % and % are of type <2) and % = <A, R), 8 = KB, 5S). 
We must show that if f(x) = {(y), then xEy, or in other words, for all ze A, 
xRz if and only if yRz, and zRx if and only if zZRy. Assume xRz. It follows 
that f(x) S f(z), and hence f(y) S f(z), which implies that yRz. The argument 
is clearly symmetric. We have therefore shown that there is a function £ 
from B onto A* such that g({(x)) = [x] for x e A. It is trivial to verify that 
8 is a homomorphism of B onto A=. . 

Notice the following relation between the concepts of homomorphic 
image and subsystem: if B is a homomorphic image of %, then B is isomor- 
Phic to a subsystem of %. For let f be a homomorphism of % onto %. Let 
be any function from B into A such that {(g¢(y)) = y for all y eB. The 
restriction of YU to the range of g yields the subsystem of % isomorphic 
with %. 

Using the above remarks we can establish at once the equivalence: 
YU 1s imbeddable in B if and only if YQ* is imbeddable in VB. 

Further, it follows that any function imbedding %* in VB is always an 
isomorphism of %* onto a subsystem of %, and of all homomorphic images 
of YU this property is characteristic of YC. 

Let K now be any class of relational systems closed under isomorphism. 
Let K* be the class of all systems isomorphic to some W* for UN é K. In effect 
we have shown above: 


(i) Kis a theory of measurement relative to a numerical relational system 
NM if and only if K* is also. 

(ii) Tf K in addition is closed under the formation of subsystems, then K* 
is the class of all systems in K Possessing only one-one numerical assign- 
ments. | 

To use our example again, if K is the class of weak orders, then K* is 
the class of simple orders. Notice that the proof in the first paragraph 
of this section is a special case of (ii). 

It should be remarked that for a relational system %, % and Y%* always 
satisfy exactly the same formulas of first-order logic not involving the 
notion of identity. Hence, if K is the class of all relational systems satisfying 
first-order axioms without identity, then K* is the class of all systems 
satisfying the axioms for K and in addition s 

(*) If Ey, then = y. 

The application of this remark to weak 
1s left to the reader. 


Consider again the case of semiorders. Let § be the class of all finite 
semiorders. For any <A, P) eS, consider the relation I of indifference 
defined above. In terms of I one can establish a simplified characterization 
of E: xEy if and only for if all 2 € A, xTz if and only if yIz. 


atisfying the axiom: 


orderings and simple orderings 
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Introduce (*) as a new axiom S54. The class of all WY e § satisfying S4 
Notice that unlike the pleasant situation with weak 


is just the class $*. 
S* is not closed under the formation 


orderings and simple orderings, the class 
of subsystems even though S is. 

For any semiorder <A, P) introduce a further relation R as follows: 
XRy if and only if for all z, if zPx then zPy, and if yPz then xPz. 

We leave to the reader the elementary verification of the fact that R 
ordering of A, and that xEy it and only if xRy and yRx. Thus, 
ering of A. The connection between 
and that, if xRx;, 


is a weak 
if <A, P) e S*, then R is a simple ord 
P and R is clearer if one notices that xPy implies xRy, 
Py, and yiRy, then xPy. 

Now let % = <A, P) be a fixed member of S*. We wish to show that YN 
has an assignment in Re, > >. Under the relation R, A is simply ordered. 
Let A = {#0 «us X,} Where x;Rx;i-1 and x; # X;-1- Define by a course of 
values recursion a sequence ag, ..., dn of rational numbers determined 


uniquely by the following two conditions: 


1 
(1) IH xdxo then a; = Faw ট 
[) | 

(2) If xx, and x;Pxj-1 where f > 0, then a; = re] a+ চল) Gat Ll 

Notice that in (2) the hypothesis implies that f < i, while in the case 
1 = i the formula for a; simplifies to a;= ai-1 + 5-41. Notice further 
that every element x; comes cither under (1) or (2); for letting x; be the 
first element such that x,Ix, there are two cases: 1=0, 1 > 0. Clearly 


we always have a; 2 0. 


We show first that a; > ai-1 by induction on . For case (1), this is obvious. 


Passing to (2), assume that wil; and xiPxj-3- If x;-1l%0 then ass < 1 
while a, > 1. Hence we can assume not x;_,lX0, OF in other words x; 1Pxo- 
Let x, be the first element such that x;_1lx, and Hi aPtni BY definition 


a1 et If j = i, there is no problem. Assume then 
i 


a+ = aT ll 
that j < ti. Now xi 1R2;, x,Rxi1, and x;Ix, hence X;lx; 1, and so by our 
choice of hk we have k < 1. By the induction hypothesis on 1, it follows 
that a, > aj-1 andas>ailk= f, the required inequality is obvious. 
If & S j—!1, then a; > aj-1 + 1. Similarly a;-1 < ax + 1, but again, by 
the induction hypothesis, a; = d4j-y and hence a; > a; 

The next step is to prove that, if x;Px,, then a; > a+ 1. Let x; be the 
first element such that xilx; and x,Px;-i1. We have }—1 > &, and, in view 
of the preceding argument, aj) 2 ai. But aj 1 < a; whence a; > apt. 

Conversely we must show that, if a; > ai 1, then x,Px;. The hypothesis 
of course implies i > #. Assume by way of contradiction that not x;Px;. 
It follows that x,Ix,. Let x; be the first element such that x,Ix;; then 


220 READINGS IN MATHEMATICAL PSYCHOLOGY 


>] and a; >a. If j= 0, then Xilxp and ,lx0, because x,Rx;. But 
then O42, <l and 0.<a, 21, which contradicts the inequality 
2; > ai,t+l. We can conclude that j > 0. Now a; < atl, but a, > a,, 
and thus a; < aj,+ 1, which again is a contradiction. All cases have been 
covered, and the argument is complete. 

Finally define a function f on A such that {(x;) = a;. We have actually 
shown that { imbeds % in <Re, >). Thus it has been proved that S* is a 
theory of measurement relative to <Re, >), and, by the general remarks 
on the method of cosets, we conclude that Sis also a theory of measure- 
ment relative to <Re, >). 

Notice that the above proof would also work in the infinite case as long 
as the ordering Ris a well-ordering of type w. 

Let us now summarize the steps in establishing the existence of measure- 
ment using as examples simple orderings and semiorders. First, after one 
is given a class, K say, of relational systems, the numerical relational 
system should be decided upon. The numerical relational system should be 
suggested naturally by the structure of the systems in K, and as was re- 
marked, it is most practical to consider numerical systems where all the 
relations can be simply defined in terms of addition and ordering of real 
numbers. Second, if the proof that Kis a theory of measurement is not at 
once obvious, the cardinality of systems in K should be taken into considera- 
tion. The restriction to countable systems would always seem empirically 
Justified, and adequate results are Possible with a restriction to finite 
systems. Third, the proof of the existence of measurement can often be 
simplified by the reduction of each relational system in K by the method 
of cosets. Then, instead of trying to find numerical assignments for each 
member of K, one concentrates only on the reduced systems. This plan was 
helpful in the case of semiorders. Instead of cosets, it is sometimes feasible 
to consider imbedding by subsystems. That is to say, one considers some 
convenient subclass K’ € K such that every element of K is a subsystem 
of some system in K'. If K’is a theory of measurement, then so is K. In 
the case of semiorders we could have used either Plan: cosets or subsystems. 

After the existence of measurement has been established, there is one 
question which if often of interest: For a given relational system, what 
is the class of all its numerical assignments? We present an example. 

Consider relational systems YU = <A, D) of type <4). For such systems 
we introduce the following definitions: Ry if and only if xyDyy. xyMizw 
1f and only if xyDzw, zwDxy, yRz and zRy. xyM"™+12w if and only if there 
exist u, ve A such that xyM"uv and uvMizw. 

Let H be the class of all such relational s 
axioms for every x, Y,2,%,0, wed: 

Al. If xyDzw and zwDuv, then xyDuv. 

A2. xyDzw or zwDxy. 


YStems which satisfy the following 
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A3. If xyDzw, then xzDyw. 

A4. If xyDzw, then wzDyx. 

AS. If xRy and yzDuv, then xzDuv. 

A6. There is a z eA such that xzDzy and zyDxy. 

A7. If not xyDzw and not XRy, then there is a ue A such that zw0Dx1, 


not xRu, and not uRy. 
A8. If 2yDzw and not xRy, then there are 1, ve A and an n such that 


zuM"vw and 2uDxy. 


These axioms imply that for a system Yin H, the relation Ris a weak 
ordering of A, and the intuitive interpretation of xyDzw in case yRx and 
wRez is that the interval between x and y is not greater than the interval 
between z and w. Making heavy use of the last three existence axioms, it 
can be shown that H is a theory of measurement relative to the numerical 
relational system <Re, A) where A is the quaternary relation defined by 
the condition xyAzw if and only if x—Y S 2z—w for all x,y,z, weRe. It 
must be stressed that the Archimedean property of the ordering embodied 
in A8 cannot be formulated in first-order logic, because it implies that all 
systems in H* have cardinality not more than the power of the continuum. 
In addition, it can be shown that, if Yis in H, and f and g are two numerical 
assignments of A relative to (Re, A), then { and g are related by a positive 
linear transformation; 10 that is, there exist «, B eRe with « > 0 such 
that, for all x e Re, f(x) = g(x) + B. This gives in a certain sense the answer 
to the question above: If we know one numerical assignment for Y, we 
know them all. Except for very special systems in H, nothing more specific 
can really be expected. 

Notice that all relational systems in H are necessarily infinite. In the 
next section we shall consider in detail the theory of measurement F 
consisting of all finite relational systems imbeddable in (Re, A). Here the 
situation is quite hopeless. There simply is no apparent general statement 
that can be made about the relation between assignments. In as much as 
any function ¢ which imbeds <Re, A> in itself is necessarily a linear trans- 
formation and conversely, it follows that, if Y% is a system in F and fis an 
assignment for %, then f composed with a linear transformation is also an 
assignment. The main difficulty with F is that two assignments for the 
same system in F need not be related by a linear transformation. 


zability. Given a theory of measurement, it is natural 
questions about its axiomatizability, for the axiomatic 
mathematical theory usually throws considerable light 
f the theory. In particular, given an extrinsic characteri- 


3. Axiomati 
to ask various 
analysis of any 
on the structure 0 
SL CEE TS fot 


10 The proofs of both t 
proofs in Suppes and Winet [6]. 


hese facts about H are very similar to the corresponding 
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zation of a theory of measurement via a particular numerical relational 
system, it is quite desirable to have an intrinsic axiomatic characterization 
of the theory to be able better to recognize when a relational system actually 
belongs to the theory. In view of the paucity of metamathematical results 
concerning the axiomatics of higher-order theories, we shall restrict our- 
selves to the problem of axiomatizing theories of measurement in first- 
order logic. 

It is a well-known result that, if a set of first-order axioms has one 
infinite model, then it has models of unbounded cardinalities. Since for 
the most part we are interested in one-one assignments with values in the 
set of real numbers, unbounded cardinalities are hardly an asset. That is to 
say, the class of all relational systems that are models of a given set of 
first-order axioms is usually not a theory of measurement. To remove such 
difficulties without having to understand them, we simply restrict the 
cardinalities under consideration. Even a restriction to finite cardinalities 
is not too strong and leads to some rather difficult questions. Thus for the 
remainder of this section we shall consider only finitary theories of measure 
ment, i.e., theories containing only finite relational systems. Such a theory 
is called axiomatizable, if there exists a set of sentences of first-order logic 
(the axioms of the theory) such that a finite relational system is in the 
theory if and only if the system satisfies all the sentences in the set. A 
theory is finitely axiomatizable if it has a finite set of axioms. A theory 
is universally axiomatizable if it has a set of axioms each of which is a uni- 
versal sentence (i.e., a sentence in prenex normal form with only universal 
quantifiers). 

It should be observed, first, that any finitary theory of measurement 
is axiomatizable. This is no deeper than saying that in first-order logic we 
can write down a sentence completely describing the isomorphism type of 
each finite relational system not in the given theory, and clearly the HES 
tions of these sentences can serve as the required set of axioms. It 1s of 
course quite obvious that we cannot in each instance give an effective method 
for writing down the axioms, since there are clearly a continuum number 
of distinct finitary theories of measurement. Notice also that if the theory 
closed under subsystems then the axioms may be taken as universal SET 
tences, and conversely. In case one considers theories consisting of all finite 
relational systems imbeddable in a given numerical relational system, then 
the problem of a recursive or effective axiomatization is simply the problem 
of whether the class of universal sentences true in the given numerical 
relational system is recursively enumerable or not. It is not difficult to 
establish that this last problem is equivalent to the problem of giving 2 
recursive enumeration of all the relation types of finite relational systems 
not imbeddable in the given numerical relational system. For numerical 
relational systems whose relations are definable in first-order logic in terms 
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of + and <, these problems do not arise since the first-order theory of 
+ and < is decidable, and it is to these relational systems that we shall 
primarily restrict our further attention. 

In the second place, in all domains of mathematics a finite axiomatization 
felt to be the most satisfactory result. No doubt the 
psychological basis for such a feeling rests on the fact that only a finite 
characterization can in one step explicitly lay bare the full structure of a 
theory. Of course an extremely complicated axiomatization may be of little 
practical value, and as regards theories of measurement there is a further 
complication. Namely, if an axiomatization in first-order logic, no matter 
how elegant it may be, involves a combination of several universal and 
existential quantifiers, then the confirmation of this axiom may be highly 
contingent on the relatively arbitrary selection of the particular domain 
of objects. From the empirical standpoint, aside from the possible require- 
ment of a fixed minimal number of objects, results ought to be independent 
of an exact specification of the extent of the domain. 

We are thus brought to our third observation: A finite universal axio- 
matization of a theory of measurement always yields a characterization 
independent of accidental object selection. To be precise, consider a fixed 
universal sentence. This formula will obviously contain just a finite number 
of variables. Hence, to verify the truth of the sentence in a particular 
relational system, we need consider only subsets of the domain of a uniformly 
bounded cardinality. Furthermore, verification for each subset is completely 
independent of any relationships with the complementary set. 

Simple orderings and semiorders are examples of this last point. To 
determine whether a finite relational system of type <2) is a simple ordering, 
one has only to consider triples of objects; for semiorders, quadruples. In 
constructing an experiment, say, on the simple ranking of objects with 

the design is ordinarily such that connectivity 


respect to a certain property, rd 
and antisymmetry of the relation are satisfied, because for each pair of 


objects the subject is required to decide the ranking one way or the other, 
but not in both directions. Analysis of the data then reduces to searching 
for “intransitive triads”. 

Vaught [8] has provided a useful criterion for certain classes of relational 
systems to be axiomatizable by means of a universal sentence. A straight- 
forward analysis of his proof yields immediately the following criterion for 
finitary theories of measurement. } 

A finitary theory of measurement Kis axiomatizable by a umiversal sentence, 
if and only if Kis closed under subsystems and there 1s an integer n such that, 
if any finite relational system YQ has the property that every subsystem of Yl 
with no more than # elements is in K, then Q 1s in K. 

Though classes of finite simple orderings and finite semiorders are two 
examples of finitary theories of measurement axiomatizable by a universal 


of a theory is usually 
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sentence, there are interesting examples of finitary theories of measurement 
closed under subsystems which are not axiomatizable by a universal sen- 
tence. We now turn to the proof for one such case. 

Let F be the class of all finitary relational systems of type <4) imbeddable 
in the numerical relational system <Re, A). A wide variety of sets of 
empirical data are in F. In fact, all sets of psychological data based upon 
Judgments of differences of sensation intensities or of differences in utility 
qualify as candidates for membership in F. For example, in an experiment 
concerned with the subjective measurement of loudness of n» sounds, the 
appropriate empirical data would be obtained by asking subjects to compare 
each of the n sounds with every other and then to compare the difference of 
loudness in every pair of sounds with every other. More elaborate inter- 
pretations are required to obtain appropriate data on utility differences for 
individuals or social groups (cf. Davidson, Suppes and Siegel [2], Suppes and 
Winet [6]). It may be of some interest to mention one probabilistic inter- 
pretation closely related to the classical scaling method of paired compari- 
sons. Subjects are asked to choose only between objects, but they are asked 
to make this choice a number of times. There are many situations in which 
they vacillate in their choice, and the probability p,, that x will be chosen 
over y may be estimated from the relative frequency with which is so 
chosen. From inequalities of the form fbzv = fzw We may obtain a set of 
empirical data, that is, a finite relational system of type <4), which is a 
candidate for membership in F. The intended interpretation is that, if 
bey 2 $ and py > 3, then pa, < hw if and only if the difference in sen- 
sation intensity or difference in utility between x and y is equal to or less 
than that between z and w, the idea being, of course, that if x and y are 
closer together than z and w in the subjective scale, then the relative 
frequency of choice of x over Y is closer to one-half than that of z over w. 

Before formally proving that the theory of measurement F is not axio- 
matizable by a universal sentence, we intuitively indicate for a relational 
system of ten elements the kind of difficulty which arises in any attempt 
to axiomatize F. Let the ten elements be aj, ..., ajo Ordered as shown 


on the following diagram with atomic intervals given the designations 
indicated. ঢ 


[a i ) | os | 1 | Ld [B, | Bs Bs | Ba | 


dj ds dg ad as as a7 ads ag ajo 


Let « be the interval (a,, as), let B be the interval (as, ao), and let y be 
larger than « or B. We Suppose further that oj, 9, 3, «4 is equal in size to 
Ba, Bs, Br, Bs, respectively, but « is less than B." 


1 Essentially this example was first given in another context by Herman Rubin 
to show that a particular set of axioms is defective. 
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The size relationships among the remaining intervals may be so chosen 
that any subsystem of nine elements is imbeddable in <Re, A), whereas 


the full system of ten elements is clearly not. 


Generalizing this example and using the criterion derived from Vaught's 


theorem Wwe now prove: 
THEOREM. The theory of measurement F is not axiomatizable by a umni- 
versal sentence. 
PRroor. In 
sentence, we need to show that f 
YQ of type <4) such that every su 
isin F but YU is not. 
To this end, for ev 


order to apply the criterion of axiomatizability by a universal 
or every 1 there is a finite relational system 
bsystem of % with # elements in its domain 


ery even integer # = 2m > 10 we construct a finite 
relational system % of type <4 such that every Subsystem of 2m—! ele- 
ments is in F. (A fortiori every subsystem of 2m—k elements for k < 2m 
is in F.) To make the construction both definite and compact, we take 
ents of the domain and disrupt exactly one numerical 
relationship. Let now #m be an even integer equal to or greater than 10. 
The selection of numbers ay, ..., 2m may be most easily described by 
specifying the numerical size of the atomic intervals. We define «,; = 
—a; for i= l,..., m—l and B; = dm+in— mss fOT $= l,l, 


numbers as elem 


diy 
We then set a, =, «4; = 2: for i= 1, ...,m—l, and an = 22m, In 
fixing the size of Bi, we have two cases to consider depending on the parity 
of m. 

CasE 1. mm is even. Then m—1 is odd, and we set Bi; = «i/2 for 1 = 2, 
4,...,m—2 and Bi; = «m+i-1l2 for i=1,3,...,m—l. 

CasE 2. mm is odd. Then m—l is even, and we set Bi = «i/2 for 1 = 2, 
4, ...,m—1 and Bi = m+? {074 = 1,3, ...,—2. Thus ifn = 2m = 12, 


3 = By «4 = Bs, «5s = Bs. With the set A = 
define the relation D as the expected nu- 
mit Md asm. H 


we have a) = by 2 = Bu ® 
{as G40) defined, we now j 
merical relation except for permutations Of Gi Lng 
X,y,z,wed and <x,y, 2 w) is not some permutation of Cay, Am, m+ 2m, 


then <x, ), 2,0) € D if and only if 


(1) 
Moreover, let a = dy b = Am, © = mip 
nine permutations of <a,b,c,d> in D: 


d = a2m- Then we put the following 


&B, 2, d, 0 <a, b,d,c) <c,b,d,a 
(b,d,a,0 Sz, 6.0, bs 6,40,00১ 
@0,d,c, a <a, d, c, b> <c,d,b,a) 


ns correspond exactly to the strict inequalities 


(These nine permutatio’ 
_c. All nine are needed to make the subsystems 


following from b—a < d L 
of <A, D> have the appropriate properties.) 
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From the choice of the numbers in A and the definition of Dit is obvious 
that <A, D) is not imbeddable in <Re, AS, that is, that <A, D) is not in F; 
for the atomic intervals between 2; and a, must add up to a length equal 
to the sum of the atomic intervals between Gm And a2, but by hypothesis 
the interval (a, a,,) is less than the interval (amn+1 d2m). It remains to show 
that every subsystem of 2m— 1 elements is in F. Two cases naturally arise. 

CASE 1. The element omitted in the subsystem is dj, dy, Am+1 OT dome 
Then the nine permutations of (2) are not in D restricted to the subsystem, 
and the subsystem is not merely imbeddable in <Re, A), but by virtue of (1) 
is a subsystem of it. 

CAsE 2. The element omitted is neither aj, ay, G41 NOT asm Let a; 
be the element not in the subsystem. There are two cases to consider. 

CASE 2a. a,;< am. For this situation we may use for our numerical 
assignment the function f defined by fla) = a, ,+1 for f= Jean dls 
Hass) =thy for T=, ..in—t It is straightforward but tedious to 
verify that fis a numerical assignment, that is, that it preserves the re- 
lation D as defined by (1) and (2). Only two observations are crucial to 
this verification. First, regarding atomic intervals (in the full system), if 
di-j41—di j= d,,i—a, for R> 1, then fai) flai;) = (a — 1) 
(a,—1) = ania, = flan) —f(a,). Second, the numbers in A were so 
chosen that, if x, Y,2,weA, and (2,w) is not an atomic interval, and 
(x,y) # (2,0) and 2—y < z—w, then %—y+2 < z—w. Then it is clear 
from the definition of f that (x) — {(y) < (2) — (w). (Note that the 
above implies the weaker result that no two distinct nonatomic intervals 
have the same size.) 

CASE 26, ar BFL. Here We may use a numerical assignment / 
defined, as would be expected from the previous case, by f(a;_;) = di-s 


for f = 1 tI, flai;) = Gg] {OF PEE ‘+, 2—t. This completes 
the proof of the theorem. 


It would be Pleasant to re 
the theory of measurem 
Unfortunately, 
such questions 
to state a conjecture Which if true would prov 
the finite axiomatizability of finitary theorie 


port that we could Prove a stronger result about 
finitely axiomatizable. 
Of tools available for studying 
ems. However, we would like 
ide one useful tool for studying 
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from Tarski’s results [7] on universal (arithmetical) classes in the wider 
sense that, if the finitistic restrictions are removed throughout in the 
conjecture, the thus modified conjecture is true; for the class of relational 
systems satisfying S, being closed under submodels, is a universal class in 
the wider sense and is axiomatizable by a denumerable set of universal 
sentences. Since § is logically equivalent to this set of universal sentences, 
it is a logical consequence of some finite subset of them; but because it 
implies the full set, it also implies the finite subset and is thus equivalent 
to it. 

Our conjecture is one concerning the general theory of models and 
its pertinence is not restricted to theories of measurement. In conclusion 
we should like to mention an unsolved problem typical of those which arise 
in the special area of measurement. Let R be any binary numerical relation 
definable in an elementary manner in terms of plus and less than. Is the finitary 
theory of measurement of all systems imbeddable in R finitely axiomatizable? 
(If our conjecture about finite models is true, then the theory of measure- 
ment F is not finitely axiomatizable and shows that the answer to this 


problem is negative for quaternary relations definable in terms of plus and 


less than.) 
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models for choice-reaction time. The Working details are confined to ap- 
pendices and only definitions 


i at 
and results Appear in the text. It is hoped th: 
this method of presentatio! 


i ৰ টী ick 
n will assist the reader in making a I 
“‘caleulated-observed” analysis of the data he may have. The choice 


to our interests; (iii) for 
need to be designed with 


Y stimulus or signal and is Ce 
nal and make an appropriate reaction. ৰ 
eaction is made. S is presented with sign 
ttributes form a random sequence; that I 
for a given run of signals, the attributes of different signals are mutua Ye 
independent and their probabilities of Presentation do not change with ন 
The models assume that S has a settled node of response. They will be hydi 


identify some attribute of the Sig 


mechanism or “computer.” A 


. i he 
4 response. The time taken for the Tesponse to be recorded will be called t 
motor time. Thus the choice 
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the decision time, T, ; the motor time, Tn» . The models 
he environmental variables (the number 
ate at which S 
it is not implied 


the input time, T; ; 
apply to T, , which will be related to t 
of signals and their frequencies of presentation) and the r. 
makes incorrect responses. By concentrating on Ts in this way, 
that T, and T,, are necessarily independent of these factors. 


Lil-elihood Ratio Models for the Two-Choice Situation 


It is assumed that the subject knows when the signal (either so Orsi, 
say) commences; that is, he knows when to start examining the stream of 
information arriving at the computer. (This stream is “noisy” until the 
stream from the signal is added to it.) This assumption holds in the seltf- 
paced condition and also when some preparatory warning signal is given. It 
is supposed that there is some overlap in the information; that is, some 
patterns of information may arise from either so Or 8; . If there is no un- 
certainty in this sense, there is no need for a statistical computer. The un- 
certainty may arise from the external situation, from noise added at the 
input stage, or from both sources. We will suppose that the information on 
which S’s computer operates is equivalent to a series of independent random 
variables at short time intervals t and that each random variable has the 
(stationary) distribution of a random variable x (dependent on which signal 


has occurred) until the response is made. 


Signal 
EA 22 gis Aas 

| | | | 

t t t 
the probabilities of x when the signal is so and 8; , 
respectively. If the 2's are instantaneous samples of an almost continuous 
stream of information then the assumption of independence implies zero 
auto-correlation between parts of the stream not less than time t apart. If 
the 2’s are integrals of the stream over the successive intervals, then the 
assumption requires Zero auto-correlation for all time lags (or at least for 
those not small compared with 1). Suppose the computer transforms each 
t to a quantity c(t) which is then stored in an adder. 


Let po(z) and pi(t) be 


Sequential Case 

ng total of c(i), c(22), +" . Constant log 
Jected so that S decides for so (and makes 
the appropriate motor action) as soon as the total falls below log B, provided 
the total has not previously exceeded log A when the decision would have 
been made for 8: - (The odd way of expressing the constants facilitates later 


The computer makes a runni 
A and log B with 4 > B are prese 
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references.) If the decision is made at the nth sample T, = nt. The theory 


of the sequential probability ratio test [1] shows that the optimum choice 
of the function c(z) is 


(0) c(z) = log pi(z) — log p(x). 


Such a function implies that S is familiar with the probability distributions 
Do(z) and pi,(z). Such familiarity may be the result of a process of CE 
provided S has performed many trials of the discrimination task an 
given knowledge of results. S’s computer may be thought of as exploratory, 
trying out different c(z2)’s until the optimal one is found. However it is of 
ceivable that the distributions can be deduced by S from the structure 0 
the situation and then imposed on his computer. The optimality of (1) is 
stated by Wald [1] in the following terms: let flo , fl, be the averages of the 
number of samples necessary for decision when the signals presented are 
$9, Si , respectively. If n% , n+ are the averages for any other decision pro- 
cedure based on mi, 2s, etc., with equal probabilities of incorrect response 
to so and s,, then nf > no, and RI > ih. It is possible that this form of 
optimality does not appeal to S, who may have to be trained to use it by 
suitable reward. K 
Before testing the model, it must be remembered that it is T which is 
measured and not T, . Even 50, a test is available which requires only the 
following assumption. Consider trials leading to a decision for so . The assump- 


tion is, given the value of Ts, that the distribution of T; + T,, is the same 
Whether the decision is tight or wrong. (The Same assumption is made for 
decisions for s, .) This does no 


t exclude the Possibility that T, + T,, and Tu 
be correlated. The length of time, T 


information presented to the comput 
natively, if T, is long, T, may be d 


ith the above assumption, 

hold for a comparison of the correct 
S leading to so (and for a comparison of those leading to si). 
This provides the basis of a reasonable test of the model. However, a fair 
Proportion of errors would be needed to give a powerful test. 

Without making Assumptions about D(z) and D(z), it is difficult to 
think of more Ways of examining the Validity of the model. Since 2 is an 
intervening variable Without operational definition, it would clearly be 
unwise to assume much about p(x) and Pilz). However, there is one assump- 
tion, called the “condition of Symmetry,” which in some discrimination 


this implies that the same result should 
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tasks may be reasonable. This is that the distribution of pi(z)/po(z), when 
2 is distributed according to p(t), is identical with that of Dpo(z)/Di(z), when 
2 is distributed according to pi(z). It is shown in Appendix 2 that, if this 


condition holds, 


(2) f/f = J(B, a)/J(a, B); 
J(a, B)vi — J(B, a)vo 
(8) = {J(8, a)a(l — a)[4ft — (fh — no)" 


— J(a, B)B(L — B)[4n5 — (fo — )]}/0 - o- BY, 
where a and B are the probabilities of incorrect response to a single so and 51 , 
respectively, v, is the variance of the sample sizes when s; is presented, and 
J(e, B) = alog [a/(1 — B)] + (1 — 0) log [(1 — a)/8]. 
If it is feasible to estimate T, directly for each trial by eliminating T; + Tm 
from T, then (2) and (3) imply 


(4) Ta/Tuw = J(8, a)/J(o, B), 
J(a, B) var Tai — J(B, a) var Tao 
6) = (J(8, a)a(l — ATE — (Ta — Tan)"] 
— Ja, BBC — BAT — (Tu — Taf/O-a- BY’. 


most relevant if S can be persuaded to achieve 
different (a, B) combinations without changing the distributions po(t) and 
pit). When a = B, then fo = fi, and vo = i ; With the assumptions that 
T, + Tis (i) uncorrelated with T, and (ii) independent of the signal pre- 
sented, this implies equality of means and variances of reaction times to the 
signals. So, for the latter special case, it is not necessary to measure T,. 

For the “condition of symmetry” it is sufficient that, with represented 
as a number, po(t) = pilt — d) for some number d with po(z) symmetrical 
about its mean. This might occur when so , 81 are signals which are close 
together on some scale and the error added to the signals to make x has the 
same distribution for each signal. Symmetry would not be expected in 
absolute threshold discriminations or in the discrimination of widely different 
colors in a color-noisy background. Another sufficient condition is that x be 
bivariate, [2(1), 2(2)], the probabilities under so obtained from those 
under s; by interchanging (1) and (2). For instance, 2(1) and (2) may 
be the inputs on two noisy channels and so consists of stimulation of the first 
while s, consists of stimulation of the second. 

A further prediction of the model for the symmetrical case can be made 
when 5 is persuaded by a suitable reward to give equal weight to errors to 
so and s, , that is to minimize his unconditional error probability, by adjust- 
ment of the constants A and B in his computer. If po is the frequency of 
presentation of so then the error probability is po & + (1 — po)B OF e, SAY, 
and the average decision time is poTso + (1 — po) Ty, or T,, say. It is shown 


Equations (4) and (5) are 
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in Appendix 3 that, provided 10e < po <1 — 106, the minimization results 
in the following relation between T,, e and Do : 


Tu oc J(e, e) — J(ps, po). 
The Non-Sequential Fized-Sample Case 


If S has an incentive to react quickly and correctly, then the advantage 
of the sequential decision procedure is that those discriminations which by 
chance happen to be easy are made quickly and time is saved. However it 
is possible that S may adopt a different, less efficient str 
fix T, for all trials at a value which will give a certain 
Let the sample size corresponding to this decision time 
ratio procedures are as follows: decide for So if c(z,) + ... + cm) < log C; 
decide for s, if e(z;) + --. + c(x,) > log C; c(2) = log p(x) — log po(r) 
and C' > 0. These procedures are optimal in the sense that, if any other 
Procedure based on zx, , +. » 2, is used, there exists one of the likelihood ratio 
procedures with smaller error probabilities. It was remarkable that in the 
sequential case useful predictions were obtainable under mild restrictions 
on pol) and pi(z). Unfortunately this does not hold for the fixed-sample 
Case, making more difficult the problem of testing w 
holds. 

If there is no input Storage, 
imposed strategy just outlined ar 
experimenter himself cuts off th. 
this is the type of Situation cor 
emphasis of these authors is m 
energy) rather than on any su 


ategy—which is to 
accepted error rate. 
be n. The likelihood 


hether such a model 


it is possible that the results of the self- 
© equivalent to those obtainable when the 
€ signals after an exposure time Tu, . But 
sidered by Peterson and Birdsall [2]. The 
ainly on the external parameters (such as 
t pposed intervening variable. They define a 
set of physical situations for auditory discrimination in terms of a parameter 
d, which is equivalent to the difference between the means of two normal 
Populations with unit variance. (For, in the cases considered, it happens 
that the logarithm of the likelihood ratio of the actual physical random 
ives is normally distributed with equality of 
tives.) This Parameter sets a limit to the various 
and si) of any discriminator using the 


More than Two Alternatives 
For m alternatives there are m Probability distributions for the inter- 
vening variable x (which may be multivariate) 


5 that is, signal s; induces an 


MERVYN STONE 233 


with the probability distribution p(t) fort = 1, , mM. We will consider 


the consequences of a fixed-sample decision procedure based ont» °° » Be, 
where 1 is fixed. 
If the signals are presented independently with probabilities pi , ‘°° , Dn 


bility of error to signal s; when the 


(adding to unity) and if a;(D) is the proba 
, 2.) is used, then the probability 


decision procedure © (based on ; , 
of error to a single presentation is 


e= > pia(D). 


t the © minimizing ¢ is that which effectively 
terior probability. In this section, this 
minimum ¢ will be related to n (or T,/t) and m when distributions are normal. 
However in the validation of the model it might be necessary to supplement 
Ti, with a time Tz, representing the time the computer requires to examine 
the m posterior probabilities to decide which is the largest. For, although it 
might be reasonable to suppose that Ti + Tis independent of m, one would 
expect Ti; to vary with m. The simplest model for Ti; would be to suppose 
that T3 = (m — DU, where {is the time necessary to compare amy two of 
the probabilities and decide which is the larger. 

We will state the relation between n and m when ¢ is constant in the 
following special case (treated by Peterson and Birdsall [3], who stated the 
relation between e and m when n is held constant by the experimenter): we 


= ..- = pa = Umandza multivariate random variable 
+ , (mM) are independent 


take p, = D2: = 

2(1), + , (mM). Under s; , Suppose that 2(1), 

and that (5) is normally distributed with mean u > 0 and unit variance, 

while the other components of x are normal with zero means and unit vari- 
-round symmetry. 20 PEL z(m) can be regarded 

e ith channel is stimulated under Ss; . 


ances. Thus there is all 

as the inputs on m similar channels. Th | 
re is to choose the signal correspond- 
]. It is shown in Appendix 5 that, 


It is readily seen that the optimal procedu 
ing to the channel with the largest tota 


with this procedure, 
ny = {1+ [0.64(m — 1) 
for those m for whiche < 1— d/m- 


It is shown in Appendix 4 tha 
selects the signal with maximum pos 


-1v2 } 0.45F}[e™01 -—-d - S7(1/m)]* 
-1 js the inverse of the normal standard- 


ized distribution function. The values of nu’ for certain values of e and m 
have been calculated. If u is independent of m, then Ta is proportional to 
nu and the results are plotted in Figure 1. It can be seen that T; is very 
nearly linear against log m, which agrees with some experimental findings 


in this field. ্‌ | 
The question may be raised whether any m-choice task can obey the 
condition of the model. Peterson and Birdsall apply the model 


symmetry 
uditory signal is presented in one of four equal periods 


to the case where an 
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FIGURE 1 


The Decision Time (Ta) for Error Rate (e) and Number of Equally Likely Alternatives (m) 


perimposed so that response 
may not be important and there may be 
Symmetry. 


Appendiz 1 
Let n,; be the Sample size for a decisio 


n in favor of s, when 8; is presented. 
The distribution of 1, is completely deter 


mined by its characteristic function, 
Yy . From A5.1 of (1, if 
$0) = > pia) p/p], 
then 
(6) (1 -— )B’yoo[—log %(t)] bE ad ‘Yiol—log $0(1)] = 1 ’ 
(7) 


log ¢,()] = 1, 


‘2 Vi; defined in Appendix 2 are small. If a < 0.1 
and 8 < 0.1 then to a good approximation A = (1-— B)/aand B = B/(1 — 0). 
= ¢$1(u); so, putting t = 1 + uin (6) and (7), 


BB*y[— log uw] + (1- B)A"y,.[— log $(w] = 1, 
(1 = B"y,,[— log $olu)] + ad“y,[—log %(u)] = 1. 
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By comparing these equations with (6) and (7), it is found that yo = Yi 
and yw = Yu. Therefore the distributions of nw and no: (and similarly those 
of nwo and nun) are identical. 
Appendiz 2 
In the case of symmetry, 
S pol) log [po(z)/n:(©)] = XZ pis) log [pi(z)/po()] = By 
and i 2 
var log [po(2)/pi()] under pole) = var log [p,(2)/po(2)] under pi(®) = V. 
From A:72 of [1], if E and V are small, 
(8) Ro = J(a, B)/E; i = J(8, o)/E. 


Therefore 
f/f = J(B, a)/J (a, B). 
By differentiating (6) twice with respect to t and substituting t = 0, using 
(8) and the fact that Yi is the characteristic function of ny , 
VJ(a,B) — all — a)l4nt — (i — fo) 


2 


= 1 —a«- 


By symmetr 
i _ VI(8,a) _ BO — BAG — (fo — mn) 


BE (== BY 


V1 
Hence 
J(a, Bo, — J(B, a)vo 
= {J(8, a)a(l — o)[47? — (fi — o)"] 
— J(o, BBC — BAO — (fo — mn)]}/O - a BY. 
Appendiz 3 
by (8), Ta « pollo, B) + (1 — DJ (8, a). 
Keeping e [or po « + (1 — Do)B] constant at a value in the range given by 
10e < py < 1 — l0e, the condition on a and B will be satisfied. It is found by 
the usual methods that the minimum Ta is proportional to J(e, e) — J(po, Do). 
Appendix 4 


If a < 0.1 and 8B < 0.1 then, 


Let X be the set of all possible Values of t = (zi, *-" , tn) and X,; the 
set of 2 for which a decision is made for 5, . Then 
6 Xn: Le DUT). 
4 AX; 


Suppose X; and X, have 2 common boundary; then, for e to be a minimum, 
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it will not be changed by small displacements in this boundary. Hence, on 
the boundary, pipi(2) = Pipi(t); that is, the posterior probability of si 
equals that of s; . Considering all Possible boundaries, the solution is that 


ZX; is the set of 2’s for which s; has greater posterior probability than the 
other signals. 


Appendix 5 
Write 


50) = YX s2.0m. 


al 


Then, under s, , Vn£(1) is N(Vnu, 1) and Vn) is NO, 1) for i= 1. 


Therefore, 


«(D) = -.. = a(D) 


= 1 -- ene f [80]""" exp [iu — Vn n)®] du. 
On integration by parts, 


(9) e= YX pals) 


|] 


(m — (217) Ks Huy" su — Vn un) exp (— iu) du 
= en(0), 


Say, where 0 = Vnu. P 
their tabulation. However en(0) (6) — 14s 8 — =; 
n(6)| is a “probability density function” for 0. 
and hence the distribution of 9 turns out to be 
LENE" = MER (hi es 


»bm-1) dv, 0), °° Um-1 
d normal Variables. R. 


tions of Peterson and Birdsall. Tf 0 is NV, 0°), we 
lows. From (9), en(0) = 1 (1/m). Also en(0) = 
1 ~—-s5(t(- »/c). Therefore 


r/o = —$™(1/m). 


Also c° = var v + var w and from Graph 4.2.2(6) of [4], var w = 
[0.64 (m — 1) 4+ 0.45]° for m < 20 


» Which determines 0°. Putting en(0) = ¢, 
the constant error rate, 


nu’ = {1+ 0.64m — I 0450) (LL - 4= $7'(1/m)]°. 
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PART II 


LEARNING AND STOCHASTIC PROCESSES 


STATISTICAL INFERENCE ABOUT MARKOV CHAINS 
T. W. ANDERSON AND Lo A. GoopMAN! 


Columbia University and University of Chicago 

ates and their asymptotic distribution 
ties in & Markov chain of arbitrary 
1s of the chain. Likelihood ratio tests 


Summary. Maximum likelihood estim 
are obtained for the transition probabili 


order when there are repeated observatior 
and x2-tests of the form used in contingency tables are obtained for testing the 


following hypotheses: (a) that the transition probabilities of a first order chain 
are constant, (b) that in case the transition probabilities are constant, they are 
specified numbers, and (c) that the process is a uth order Markov chain against 
the alternative it is rth but not uth order. In case u = 0 and? = 1, case (¢) 
results in tests of the null hypothesis that observations at successive time points 
are statistically independent against the alternate hypothesis that observations 
are from a first order Markov chain. Tests of several other hypotheses are also 
considered. The statistical analysis in the case of 2 single observation of a long 
chain is also discussed. There is some discussion of the relation between likeli- 
hood ratio criteria and x*-tests of the form used in contingency tables. 


1. Introduction. A Markov chain is sometimes a suitable probability model 
the observation at a given time is the category 
The simplest Markov chain is that in which 
or categories and a finite number of equi- 
distant time points at which observations are made, the chain is of first-order, 
and the transition probabilities are the same for each time interval. Such a 
chain is described by the initial state and the set of transition probabilities; 
namely, the conditional probability of going into each state, given the im- 
mediately preceding state. We shall consider methods of statistical inference 
for this model when there are many observations in each of the initial states 
and the same set of transition probabilities operate. For example, one may wish 
to estimate the transition probabilities or test hypotheses about them. We de- 
velop an asymptotic theory for these methods of inference when the number of 
Observations increases. We shall also consider methods of inference for more 
general models, for example, where the transition probabilities need not be the 
same for each time interval. 
An illustration of the use of some of the statistical methods described herein 
has been given in detail [2]. The data for this illustration came from a “panel 
y 1940 presidential election each of a 


study” on vote intention. Preceding the 
number of pote as asked his party or candidate preference each 


for certain time series in which 
into which an individual falls. 
there are a finite number of states 


ntial voters W 
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month from May to October (6 interviews). At each interview each person was 
classified as Republican, Democrat, or “Don’t Know,” the latter being a residual 
category consisting primarily of people who had not decided on a party or 
candidate. One of the null hypotheses in the study was that the probability of 
a voter's intention at one interview depended only on his intention at the im- 
mediately preceding interview (first-order case), that such a probability was 
constant over time (stationarity), and that the same probabilities hold for all 
individuals. It was of interest to see how the data conformed to this null hy- 
pothesis, and also in what specific ways the data differed from this hypothesis. 
This present paper develops and extends the theory and the methods given 
in [1] and [2]. It also presents some newer methods, which were first mentioned 
in [9], that are somewhat different from those given in [1] and [2], and explains 
how to use both the old and new methods for dealing with more general hy- 
potheses. Some corrections of formulas appearing in [1] and [2] are also given 
in the present paper. An advantage of some of the new methods presented 
herein is that, for many users of these methods, their motivation and their 
application seem to be simpler. 
The problem of the estimation of the transition Probabilities, and of the test- 
ing of goodness of fit and the order of the chain has been studied by Bartlett 
[8] and Hoel [10] in the situation where Only a single sequence of states is ob- 
served; they consider the Asymptotic theory as the number of time points 
increases. We shall discuss this situation in Section 5 of the present paper, where 
2 x*-test of the form used in cor e5 is given for a hypothesis that is 


@ generalization of a hypothesis that Was considered from the likelihood ratio 
Point of view by Hoel [10]. 


Od ratio criteria and x*-tests, 

: related to some ordinary contingency 
table procedures. A discussion of the relation between likelihood ratio tests 
and x2-tests appears in the final Section. 


For further discussion of Markov chains, the reader is referred to [2] or [7]. 


2. Estimation of the parameters of a first-order Markov chain. 


2.1. The model. Let the states be i; = LQ, se *,m. Though the state 1 is 
i om I to m, no actual use is made of 
be, for example, a political party, & 
» 0), etc. Let the times of observation 
্ঃ 2 2 aT Let Di) CE; = LET ES TY » T) be the proba- 
billy at State J at me t — 1. We shall deal both with 
(a) stationary tran. is, p(t) = Dpifort= ls, T) 
ilities (that is, where the transition 
time interval). We assume in this 
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individual consists of the sequence of states the individual is in at t= 0, lL, ++, 
T', namely i(0), (1), (2), *-- , iT). Given the initial state i(0), there are m~ 
possible sequences. These represent mutually exclusive events with probabilities 


(2.1) Pioia Piwi®) *** Picr-vin) 
probabilities are stationary. (When the transition prob- 


when the transition 
stationary, symbols of the form picnic should be 


abilities are not necessarily 
replaced by pic-viw (0) throughout.) 

Let ni;;(t) denote the number of individuals in state + at t — 1 and j at t. 
We shall show that the set of nii(t) ET = Hes yy El = ly ey T), a set 
of mT numbers, form a set of sufficient statistics for the observed sequences. 


Let nioiay...i«n be the number of individuals whose sequence of states is (0), 


(1), --.,(T). Then 

(2.2) nsi(t) = ডি NiO) 50) - --i(T) 

(t— 1) = 9 and it) = J. The 
bing all sequences for all n 
of a given ordered 


all values of the 7's with 1 
T dimensional space descri 
itial state there are nT dimensions), 


ndividuals is 


where the sum is over 
probability, in the nm 
individuals (for each in 
set of sequences for the n i 


II pioiw(D) piwio(2) Dna LL ERS 
_ TT pow) c- Opin) 
nsCo)iQ NP) NiCT-D ( 
(2.3) = ( II pion 0) 2) ‘a (Ee pur-nien (T) Loss ') 


(0D) 


= TID pu 0", 


t=1l 013 
first two lines are over all values of the T + 1 indices. 


where the products in the L Df 
Thus, the set of numbers 1i;(t) form a set of sufficient statistics, as announced. 
The actual distribution of the nii(t) is (2.3) multiplied by an appropriate 
(-—-1)= Yi i300). Then the conditional 


function of factorials. Let ni | | 
distribution of (Dd = Ls xy given nit — 1) (or given mls), k= 1, , 
m;s=0,.:,t- is 

(= DIE EY 
(2.4) nt II ps0". 

TI na(! ™ 


j=1 
tion as one would obtain if one had ni(t — 1) observa- 
bution with probabilities pii(t) and with resulting 
s5(t) (conditional on the ni(0)) is 


This is the same distribution 
tions on a multinomial distri 
numbers ni;(t). The distribution of the n 


m 


I I nl লা DET 5s (ORES 
1=1 1 i=! TI ni! i=1 


=i 


(2.5) 
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For a Markov chain with stationary transition probabilities, a stronger result 
concerning sufficiency follows from (2.3); namely, the set nij = De n;,() 
form a set of sufficient statistics. This follows from the fact that, when the 
transition probabilities are stationary, the probability (2.3) can be written 
in the form 


- 
(2.6) UIT oy = T pis. 
0.) 1) 


For not necessarily stationary transition probabilities pi;(t), the ni,(t) are a 
minimal set of sufficient statistics. 


2.2. Maximum likelihood estimates. The stationary 
pi; can be estimated by maximizing the probability (2 
Di; , Subject of course to the restrictions Pi; 2 0 and 


(2:7) 2 pi= 1, LEE 


When the n;; are the actual Observations. This Probability is precisely of the 
same form, except for a factor that does not depend on pi; , as that obtained 
for m independent samples, where the ith sample (7 = 1, 2, ... , mM) consists of 
ni = Zins; multinomial trials with probabilities p;; (5, j = 1,2, ..., m). For 


Such samples, it is well-known and easily verified that the maximum likelihood 
estimates for pi; are 


transition probabilities 
.6) with respect to the 


m, 


hs = nin? = > nl) YX nn 
28) i=1 k=l 1-1 


T T-1 
=! 2 s00)/ 2 nil), 


Dba " parameter-free factors, and the re- 
Strictions on the p;; are the same. In parti 


icular, it applies to the estimation of 
the parameters Di; in (2.6). 


When the transition probabilities 
approach used in the preceding parag 
likelihood estimates for the p 


(2.9) 


Are not necessarily stationary, the general 


raph can still be applied, and the maximum 
it) are found to be 


bill) = ni(0)/n(t — 1) = nad / 5° nin(D). 


The same maximum likelihood estimates for the Pii(t) are obtained when we 
consider the conditional distribution of n 


onsider ii(t) given ni(t — 1) as when the joint 
distribution of the ni5(1), ni5(2), » ii(T) is used. Formally these estimates 
are the same as one would obtain if for each 5 and t one had nit — 1) observa- 
tions on a multinomial distribution with probabilities Pii(t) and with resulting 
numbers ni;;(t). 
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The estimates can be described in the following way: Let the entries ni;(t) 
for given t be entered in a two-way m X m table. The estimate of pii(t) is the 1, 
Ith entry in the table divided by the sum of the entries in the ith row. In order 
to estimate p,; for a stationary chain, add the corresponding entries in the two- 
way tables for t = L, T, obtaining a two-way table with entries n:;; = 
Srni(d. The estimate of pi; is the 7, jth entry of the table of nis divided by 
the sum of the entries in the ith row. 

The covariance structure of the maximum likelihood estimates presented in 


this section will be given further on. 


2.3. Asymptotic behavior of nis(t). To find the asymptotic behavior of the 
bij, first consider the nis(l). We shall assume that ni(0)/YS ni(0) — tm 
(i = 0, Xt = UBS SY ni(0) = =. For each (0), the set Niwa... Are 
simply multinomial variables with sample size niw(0) and parameters 
Pioyiay Pia) *** Dir-vitn and hence are asymptotically normally distributed 
as the sample size increases. The ni;(t) are linear combinations of these multi- 
nomial variables, and hence are also asymptotically normally distributed. 

Let P = (pi) and let pi;' be the elements of the matrix P'. Then pis’ is the 
probability of state J at time t given state 7 at time 0. Let n,;ii(t) be the number 
of sequences including state k at time 0, i at time t — 1 and j at time t. Then 


we seek the low order moments of 


(2.10) AW = 2 mii). 

The probability associated with m;i5(0) is pi pis with a sample size of (0). 
Thus 

(2.11) Eneiss(t) = (Op Dis, 

(2.12) Varfnis(D} = ni(O)pii “pill — pe pil 


5 1 3 
(2.13)  Cov{nisi(0), tio} = —n(O)pli ™pispho Pos (G5) = (Gg, h), 
since the set of t;i;(t) follows a multinomial distribution. Covariances between 
other variables were given in [1]. 

Let us now examine moments of ms; 50) — mill — 1)pi;, Where nit — 1) = 
Do; nit); they will be needed in obtaining t 5 
Procedures. The conditional distribution of ne;ii(t) given Me; 
Seen to be multinomial, with the probabilities Di; - Thus, 


6fnsiisi(0) | nea(t— DD} = Pi nell — 1), 


he asymptotic theory for test 
(t — 1) is easily 


(2.14) 


(2 Emit) — neilt — Dp 
15) = E8flnia(D) — mil — Dpill ult — DD] = 0. 
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The variance of this quantity is 
8fnsii(t) — mist — D pif 


= 86l[mii5(t) — nit — 1) pif mst — DD 
Ensii(t — 1) pil — pis) 


(2.16) 


[| 


[t—1 


= nil0) pi pil — pi). 


The covariances of pairs of such quantities are 


6[niii(t) — nist — 1) pile; a(t) — iit — 1) pul 


(2.17) = E6f[n;i(t) — mit — 1piilsa(t) — nit — Dpal net — DI 
= 8{—nsiilt — 1) pis pa] = —na(0) pili" Pipi, jh, 
Efns;it) — nit — 1)pi;ilns, ont) — iol — 1) pon] 
(2.18) = 88ilis(t) — nist — Dpislnsa(t) — neolt — Dponl 
[nist — 1), nsio(t — 1] 
2; ing. 
Efnsiit) — nit — Dp, + 1) — Hie al 
(2.19) = E6[ns;ii(t) — mit — Upilnsa(t 1) — Hs te Ee DB 


[molt +r -— 1, nit —- 1) 
= 0, 


> tsi) |} 


r > 0. 

To summarize, the random variables ny,;,(t) — n 
m have means 0 and variances and covariance 
Probabilities p;; and sample size MA(O)pLi™, Th 
and ne,on(s) — Mio(s — pon are Uuncorrel: 


Since we assume n(0) fixed, ny,;,(t) 
Thus 


kiilt — lpi; for j = 1, dd 
S of multinomial variables with 
e Variables ns; ;5(0) — nit — DP: 
ated if = sori mg. 

and nit) are independent if k = lL 


(2.20) élnii(t) — nit — 1)p;;i) = 0, 

(227) inst — nit — Dp = 2 Opp — ps), 
nid) — nit — Dpilna(t) — nit — Dn, 

(222) )pal 


7 SlOp pp, jh, 


(2.23) éln;i;(t) a nit = 1)piiln(s) — n,(s থা 1)pal] or 0, txsori=g. 
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2.4. The asymptotic distribution of the estimates. It will now be shown that 
when n — », 
T 
টী nisl) 
Vn(bis — D5) = Vn | = ক — Ds 
Xnt-D 
t=1 
1 
SD nid) — pint — |) 
(5.23) Vil = —_ 
Ynt-1 
i=l 


mT 


2s x [nit — pint — 1] 
> nt —-D 


tl 
has a limiting normal distribution, and the means, variances and covariances 
of the limiting distribution will be found. Because nit) is a multinomial 
variable, we know that 
(2.25) n;i5(0)/N ~~ [n;s5(0)/nx(0) Jn 


ability to its expected value when n(0)/n 


= Vn 


converges in prob. y —> 1 . Thus 
1 1. 
plim — Xn(—-1)= lim দ্‌ 825 muE—D 

n-eco It t=1l 


neo NM tml 
T 


(2.26) m 
= ; LS 
2 mk 2 Dri 


= 1 t=1l 
) has the same limit distribution as 


> nist) — pinilt — 1)/n* 


i=l 


(2.27) RE 
3 DL VL py 


k=l t=1 


Therefore n° (pi; — Di; 


(See p. 254 in [6)). 


From the conclusions in Section 2.3, the numerator of (2.27) has mean 0 and 


variance 


(2.28) 6 [5 nid) — pint — 1 


tween two different numerators is 


k=l t=1 


| /" = DY m(O)pi™ pill — pi)/n. 


The covariance be 


6 2 nid) — pimilt — p> nal) — Datolt — v|/" 


(2.29) mT 
= —bis YD Xo (0) pet pis Don/ ns 


k=l t=1 


Where 6io = Oifti= g and 6; = 1. 
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Let 


™m Ey 
(2.30) 2 Lun" =o. 
Then the limiting variance of the numerator of (2.27) is ¢; pis(l — pi), and 
the limiting covariance between two different nume 
Because the numerators of (2.27) are linear combina 
nomial variables, with fixed probabilities and increas 
2 limiting normal distribution and the Variances an 
distribution are the limits of the respective variance: 
Theorem 2, p. 5 in [4]). 

Since ny? (bi; — Dpi;) has the same limit distribution 
the = pi) have a limiting joint normal distribution 
Pill — Dis)/%;and the covariances — 6; p:pon/®; - The variables (n¢.) (hi, — ps) 
have a limiting joint normal distribution with means 0, variances pi,(l — pis) 
and covariances —6i0DiiDon - Also, the set i) (pi; — pi) has a limiting 
joint normal distribution With means 0, variances pi,(1l — pi;j) and covariances 
—6ioPiipoi , Where nf = STs ni(t). 

In other terms, the set (ng,)"* (Di; — pi) for a given 7 has the same limiting 
distribution as the estimates of multinomial probabilities pis with sample size 
M$; , Which is the expected total number of observations nF in the ith state for 


b= Oat f= 1. The variables (ng,)"* (pi; — pi) for m different values of ) 
independent, (i.e., the limiting joint 


(Cs ee HE 
distribution factors), and hence have the same limiting joint distribution as 
estimates of multinomial probabilities 
ample sizes ng; (i = L250: Tt 
Ypotheses about the pi; in terms of m 
1omial trials. 
at the variables Pill) = ni)/nilt — 1D) 
mMptotic distribution as the estimates of 
U Sizes Eni(t — 1), and the variables pii(t) 
for two different values of 2 Or two different Values of 1 are asymptotically inde- 
methods similar to those used earlier in 


Tators is —6io Oi Pi; Doh - 
tions of normalized multi- 
ing sample size, they have 
d covariances of this limit 
S and covariances (see, C.g., 


as (2.27), the variables 
with means 0, variances 


samples consisting of multin 


omial trials 
be applied. 


» And standard test procedures may then 


inference. Here we shall assume that 
every pi; > 0. 


First we consider testing the hypothesis that certain transition probabilities 
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pi; have specified values pj. We make use of the fact that under the null hy- 
¢;) have a limiting normal distribution with means 


pothesis the (n°) (hii — Dis 
Zero, and variances and covariances depending on pt; in the same way aS ob- 
ndard asymptotic theory for 


tains for multinomial estimates. We can use star 
multinomial or normal distributions to test 2 hypothesis about one or more 
pi; , Or determine a confidence region for one or more Di; - 

As a specific example consider testing the hypothesis that pi; = Dt; 3 = 
1,-..,m, for a given i. Under the null hypothesis, 


m a I 2 
(3.1) bo (Bs — 
i Dis 


m — 1 degrees of freedom (according to 
1 variables). Thus the critical region 


2 


has an asymptotic x-distribution with 
the usual asymptotic theory of multinomisa 
of one test of this hypothesis at significance level a consists of the set pi; for 
which (3.1) is greater than the a significance point of the x -distribution with 
m — 1 degrees of freedom. A confidence region of confidence coefficient « con- 
sists of the set n: j for which (3.1) is less than the « significance point. (The pn: i 
in the denominator can be replaced by Dis ) Since the variables n¥ (pi; — Di) 
for different 7 are asymptotically independent, the forms (3.1) for different 1 are 
asymptotically independent, and hence can be added to obtain other x -variables. 
For instance a test for all pi; (EE bn 2s m) can be obtained by adding 
(3.1) over all 1, resulting in 2 x -variable with m(m — 1) degrees of freedom. 
The use of the x-test of goodness of fit is discussed in [5]. We believe that 
there is as good reason for adopting the tests, which are analogous to x -tests 
of goodness of fit, described in this section as in the situation from which they 


were borrowed (see [5])- 

3.2. Testing the hypothesis that the transition probabilities are constant. 
In the stationary Markov chain, Pij is the probability that an individual in 
state tf at time tl — 1 moves to state j at t. A general alternative to this assump- 
tion is that the transition probability depends on t; let us say it is Dii(0). We test 
the null hypothesis H:pi(t) = Di ¢ = “00s Under the alternate hy- 
pothesis, the estimates of the transition probabilities for time t are 


i 0) 
(3.2) pul) = লা) - 


The likelihood function maximized under the null hypothesis is 


ু 


63) HU". 


wl 


The likelihood function maximized under the alternative is 


(34) I IH OE 
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The ratio is the likelihood ratio criterion 


4 slight extension of a theorem of Cramér [6] or of Neyman [11] shows that 
—2 log ) is distributed as x2 with (T — 1) [m(m — 1)] degrees of freedom when 
the null hypothesis is true. 
The likelihood ratio (3.5) resembles likelihood ratios obtained for standard 
tests of homogeneity in contingency tables (see [6], p. 445). We shall now de- 
velop further this similarity to usual procedures for contingency tables. A proof 
that the results obtained by this contingency table al 
equivalent to those presented ea 
For a given 5, the set 
mates of multinomial 
table, which has the 
used to represent the j 
andt= 1,92, ... 5 


LACE 0 


1 |pa(l) bil) -.. Dan(l) 
2 ba(2) ba(2) ... Din(2) 


| Bal) Bal opt 


“test of homogeneity seems 
to test this hypothesis, We calcula 
(3.6) xi = Zz nit — Dot) — bil’ / Ds; 

if the null hypothesis is true, x: has the usual limiting distribution with = 
(7 — 1) degrees of freedom. 

Another test of the hypothesis of homo: 
from multinomial trials can be Obtained by 
that is, in order to test this hypothesis for 
calculate 


(8.7) 


Eeneity for T independent samples 
Use of the likelihood ratio criterion; 
the data given in the m X T table, 


N= TI 0s / B,0pa, 
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The preceding remarks relating to the contingency table approach dealt 


with a given value of 1. Hence, the hypothesis can be tested separately for each 


value of 1. 
Let us now consider the joint hypothesis that pill) = pis for alli = 12 5 
mf = 12, mt= le, T. A test of this joint null hypothesis follows 


hat the random variables pis(t) and pi; for two different 
dependent. Hence, under the null hypothesis, 
- , mM are asymptotically independent, 


directly from the fact t 
values of t are asymptotically in 
the set of x£ calculated for each + = 2s 
and the sum 


(3.8) X= Di X= 3 2 nit — Dbl — pil / Dis 
i=l j Bb) 

has the usual limiting distribution with m(m — 1(T — 1) degrees of freedom. 

Similarly, the test criterion based on (3.5) can be written 


(3.9) $5 —2]log A: = —2log 2. 

3.8. Test of the hypothesis that the chain is of a given order. Consider first a 
second-order Markov chain. Given that an individual is in state 7 at t — 2 and 
inj att — 1, let pis) (CELA TE mjl= 2,3," T) be the probability 
of being in state k at t. When the second-order chain is stationary, pit) = 
Dijk fOr L = 2, T. A first-order stationary chain is & special second-order 
chain, one for which pisi(t) does not depend on 1. On the other hand, as is well- 
known, the second-order chain can be represented as a more complicated first- 
order chain (see, €.£. [2]). To do this, let the pair of successive states 1 and § 
define a composite state (i, J). Then the probability of the composite state 
(J, E) at t given the composite state Gj)att-1 is piix(t). Of course, the prob- 


ability of state (h, Kk), h = J, given (4, 5), is zero. The 


A HB 2 
seen to form a chain with m states an 


This representation is useful because some 0! 
chains can be carried over from Section 2. } Ne 
Now let nisi(t) be the number of individuals in state 7 at fi 2, ing at fl 
and in ¥ at t, and let nit —- 1) = Dy nis). We assume in this section that 
the n,(0) and nii(l) are nonrandom, extending the idea of the earlier sections 
where the ni(0) were nonrandom and the ni5(1) were random variables. The 
T) is a set of sufficient statistics for 


nn Og Sle UE Ye t of suf 
the different sequences of states. The conditional distribution of niin(t), given 
nist — 1), is 

nit — D!T nga 
(8.10) Tao! LH Dre. 

Ek 


' 
(When the transition probabilities need not be the same for each time interval, 


the symbols pix Should, of course, be replaced by the appropriate pst) through- 


252 READINGS IN MATHEMATICAL PSYCHOLOGY 


out). The joint distribution of nil) fori, j,k = 1... ,mandt = Dees 
When the set of n1;,(1) is given, is the product of (3.10) over 1, J and t. 

For chains with stationary transition probabilities, a stronger result conucern- 
ing sufficiency can be obtained as it was for first-order chains; namely, the 
numbers mij: = Rs nij(t) form a set of sufficient Statistics. The maximum 
likelihood estimate of Dis for stationary chains is 


m y ; 
(3.11) bin = nn) 5 Ni = 2 nd) / XD nile — D. 
= t=2 t=2 


Now let us consider testing the null hy 


pothesis that the chain is first-order 
against the alternative that it is second 


-order. The null hypothesis is that 
Pit = Dat =. = Dis = Pie, Say, forj, k = 1, ..., m. The likelihood 
ratio criterion for testing this hypothesis is* 


(3.12) N= Ll sms Rego) 
ijkl 
Where 
™m m ™m 7 TI 
(3.13) Bi = 2 nn) 2 ni = YS nil) / ni(t) 
T= t=1 l=: l=2 


tl 
is the maximum likelihood estimate of bjs. We see here that p,. differs some- 
what from (2.8). This difference is due to the fact that in the earlier section the 
nij(1) were random variables while in this section we assumed that the nisl) 
Were nonrandom. Under the null hypothesis, —~2 log A has an asymptotic x - 
distribution with mm — 1) — mm — 1)? degrees of freedom. 
We observe that the ) resembles likelihood ratios ob- 
tained for problems relati y 8. We shall now develop further 
this similarity to standard pr i 


For 2 given j, the n"® (bist — pix) have the same 
the estimates of multinomial Probabilities for m in 


2,... mM). An m X m table, which has the 
contingency table, can be used to represent 
and for 5, k = 1 2,..., m. The null hypoth 
2,..., mm, and the xX -te 


pothesis, calculate 


asymptotic distribution as 
dependent samples (1 = 1, 
Same formal appearance as a 
the estimates Di for a given J 
esis is that Dix = pix for t= 1, 
st of homogeneity seems appropriate. To test this hy- 


(3.14) Xx; = 2 ns (Dis ee Ba) lps | 

Where | 

* + T r-1 
(3.15) Nij = % Ni = 2 PLEO) = bE nit — fs bY nij(l). 


t=2 t=1 


Tf the hypothesis is true, x; 


degrees of freedom. 
OU tenn 


has the usual limiting distribution with (m — 1) 


* The criterion (3.12) Was written incorrectly in (6.35) of [1] and (4.10) of [2]. 
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In continued analogy with Section 3.2, another test of the hypothesis of 
homogeneity for m independent samples from multinomial trials can be ob- 


tained by use of the likelihood ratio criterion. We calculate 
(3.16) A; = II (Dix / ha), 


which is formally similar to the likelihood ratio criterion. The asymptotic 
distribution of —2 log A; is x with (m — 1)* degrees of freedom. 
The preceding remarks relating to the contingency table approach dealt with 


a given value of J. Hence, the hypothesis can be tested separately for each 


value of J. 


Let us now consider the joint hypothesis that pis = Dik foralli, j,k =, 


2, ... , mM. A test of this joint hypothesis can be obtained by computing the sum 
(3.17) X= bo X= > niin — Dn) / bie, 
=! ED 


which has the usual limiting distribution with m(m — 1)’ degrees of freedom. 
Similarly the test criterion based on (3.12) can be written 


2 IgA = 2 X nije 10g [Dis / bil 
1) 


|| 


by —2 log Ai 
(3.18) ee | h 
= 2 2 nije [log Pie — log bil. 
FT 
‘The preceding remarks can be directly generalized for a chain of order r. 
j tL = 1,2, m) denote the transition probability of 


Lab nyse (Bodh “sR k ition 
state lL at time tl, given state kat timet — 1: and state j at time t— Tr + 1 


and state iat time t — T (B= in, TH hy yy T). We shall test the null hypothesis 

that the process is a chain of order r — 1 (that iS, Pij..t = Di--kl fori = 1, 

2,...,m) against the alternate hypothesis that itisnotanr— 1 but an r-order 
, 


chain. aS 
Let ni... x(t) denote the observed frequency of the states 1, J,‘ k,l at 


the respective times EE oh hit ME t — 1, t, and let nij...xt — 1) = 
MTs mi5...000). We assume here that the nij...Mr — 1) are nonrandom. The 


maximum likelihood estimate of Dij---kl is 


i nt 
Dij..ckt = Mij- kV Nf aks 


(3.19) 
= ঠা nj...) and 


where hij...kt = 
ss T™! 
E = DMF = XY nat — 1) = 27 Nij-ak(6). 
| 


=r = 


(3.20) nt 


+. , Ek, the set Dis. will have the same asymptotic distribu- 
f multinomial probabilities for m independent samples (1 = 
be represented by an m X m table. If the null hypothesis 


For a given set J, * 
tion as estimates 0 
2, =, mM), and may 
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(Bing = Din lON £ = 1, 2, = mM) is true, then the x -test of homogeneity 
Seems appropriate, and 

(3.21) X..4 = > REE =D.) J Bs sn § 

Where 


y 4 T~1 

(8.22) pin = XD ni.u/ Dnt. = 2 n..ult) Lal, 

has the usual limiting distribution with (m — 1 degrees of freedom. We 5ee 
here that ,...., differs somewhat from the maximum likelihood estimate for 
D;...xt for an (r — 1)-order chain (Hie, Dolo CE (0B n;...A(t)). This 
difference is due to the fact that the nj...(r — 1), for an (r — 1)-order chain, 
are assumed to be multinomial random variables with parameters pj... while 
in this paragraph we have assumed that the Nj... ir — 1) are fixed. 

Since there are m=! BES f, =k GF = L 2, se 1m cs jh = 1,2, ce, mM) 
the sum bE: X3...x Will have the usual limiting distribution with mm -— 1) 
degrees of freedom under the joint null hypothesis (Pi... = Dyaat for 1 = 
1,2, ..., m and all values from 1 tom of J, ..., ) is true. 

Another test of the null hypothesis can be Obtained by use of the likelihood 
ratio criterion i 


(3.23) Ni... = I (Bi..n/ pis...) A 
Where —2 log l;..., is distributed Asymptotically as x with (m — 1)* degrees 
of freedom. Also, 


(899 log) = 2 sis Hiatdelbrs ahha 


has a limiting X-distribution with mw (tm — 1)* degrees of freedom when the 
Joint null hypothesis is true (see [10)). 


Asymptotically as X with [mr — m](m — 1) de- 
1 the null hypothesis is true. 
We have assumed that the transition probabilities are the 


ক € interval, that is, Stationary. Tt is Possible to test the null 
hypothesis that the rth Order chain has Stationary transition probabilities 
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using methods that are straightforward generalizations of the tests presented 
in the previous section for the special case of a first-order chain. 


3.4. Test of the hypothesis that several samples are from the same Markov 
chain of a given order. The general approach presented in the previous sections 
can be used to test the null hypothesis that s (s 2 2) samples are from the same 
rth order Markov chain; that is, that the s processes are identical. 

Let pi. = nt nfntie denote the maximum likelihood estimate of the 
rth order transition probability Dis ..xt for the process from which sample h 
(h = 1,2, :-. ;, 8) was obtained. We wish to test the null hypothesis that Di. = 
DPiyant OEMS LiL 8: Using the approach presented herein, it follows that 


oe 2 হে #(h) 2 (h ৰং সা 

(3.25) Kijak = Hoh READ xt — [A (DG ts 

where ntl an = 2% nf... and by. = ng ../ Yok 15:40 » has the usual 

limiting distribution with (s — 1)(m — 1) degrees of freedom. Also, br 1 k 

Xi;j...x has a limiting x-distribution with m(s — 1)(m — 1) degrees of freedom. 
When s = 2, xij...+ can be rewritten in the form 


(3.26) Xia = Yor Cus (Bf. — DE xD) (DG! ats 
where pi} ..xr is the estimate of pis... Obtained by pooling the data in the two 
samples, and Ci; ..4+ = (nin) + (1/nin). Also, 2st Xj... has the 
usual limiting distribution with m'(m — 1) degrees of freedom in the two sample 
Case. 

Analogous results can also be obtained using the likelihood-ratio criterion. 

3.5. A test involving two sets of states. In the case of panel studies, a person 
is usually asked several questions. We might classify each individual according 
to his opinion on two different questions. In an example in [2], one classification 
indicated whether a person saw the advertisement of a certain product and the 
other whether he bought the product in a certain time interval. Let the state 
be denoted (a, B), « = 1,---,4andB= 1, --- , B where a denotes the first 
opinion or class and B the second. We assume that the sequence of states satisfies 
a first-order Markov chain with transition probabilities Pas,ur - We ask whether 
the sequence of changes in one classification is independent of that in the second. 
For example, if a person notices an advertisement, is he more likely to buy the 
product? The null hypothesis of independence of changes is 
(3.27) Dabur = Tantoy y= Ly we AIDE Ly #50) 
ability for the first classification and rs, is for the 
hood ratio criterion for testing this null hypothesis. 
of individuals in state (a, B) at t — 1 and (u,v) 
Its, the maximum likelihood estimate of Dpes,us, 


Where qan 1S 2 transition prob 
second. We shall find the likeli 

Let nas.u(t) be the number 
at t. From the previous resu 


when the null hypothesis is not assumed, is 


MaB.uv 


(3.28) haw = EB 
১ 3 2 Ma sh 


s=1l h=1 
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Where nas» = Xo tas.us(t). When the null hypothesis is assumed, the max- 
imum likelihood estimate of PaB.ur 1S Gan gy , Where 


(3.29) 


(3.30) EB Se 


The likelihood ratio criterion is 


(3.31) A=TI I I (te bees 


t=1 ayml Biv) \ Dabur 


Under the null hypothesis, ~2 log A has an asymptotic x-distribution, one 
the number of degrees of freedom is AB(AB-1)- AU -1- BB = VY) = 
(4 - D(B- DUB + A+B). 


4. A modified model. In the preceding sections, we assumed that the ni(0) 
Were nonrandom. An alternative is that the ni(0) are distributed multinomially 
with probability 1; and Sample size n. Then the distribution of the set nii(t) 
is (2.5) multiplied by the marginal distribution of the set ni(0) which is 


ন II ae 
Fu, 


II no): ™' 


In this model, the maximum like 
maximum likelihood e 


(4.1) 


lihood estimate 
Stimate of ni is 


(49) f= 0) 


Of p., is again (2.8), and the 


The means, variances, and covariances of mn 


taking the expected values of (2.20) to (2.23 
NA(0) replaced by nn: . Also Nil) — ni 
Since n(0)/n e: 
of ni? 


500) — nit — 1)p,; are found by 
); the same formulas apply with 
(t — I)p;; are uncorrelated with 1t(0). 
the ASymptotic variances and covariances 


(hii — pi) are as in Section 2.4. Jt follows from these facts that the 
Asymptotic theory of the tests giv 


en in Section 3 hold for this modified model. 
The Asymptotic variances and 


covariances Simplify somewhat if the chain 
Starts from a Stationary State; that is, if 


(4.3) 2 MDs = i. 


Stimates ns consistently, 
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For then YD me pi = niand oo; = Tn. Hit is known that the chain starts 
from a stationary state, equations (4.3) should be of some additional use in the 
estimation of piri When knowledge of the 1:, or even estimates of the m:, are 
is paper with the more general case where it 1s 


available. We have dealt in thi 
not known whether (4.3) holds, and have used the maximum likelihood esti- 
1 for the more general case are not 


mates for this case. The estimates obtained 
efficient in the special case of a chain in a stationary state because relevant 
information is ignored. In the special case, the maximum likelihood estimates 
for the 1; and pi; are obtained by maximizing log L = Yniilog pi + Yni(0) 
log 7; subject to the restrictions ips = 1, Yinps = i, Yin =hbpiz 
0, 1; 2 O. In the case of a chain in a stationary state where the 7; are known, 
the maximum likelihood estimates for the pi; are obtained by maximizing 
Yn; log pi; subject to the restrictions Yip = 1, Yinpi = ni, 2 0. 
Lagrange multipliers can be used to obtain the equations for the maximum 
hood estimates. 
tion on a chain of great length. In the previous sections, 
ted for nil0) — =, and hence YO nil0) = 
case of one observed sequence of states (n = 1) 
and Hoel [10], and they consider the asymptotic 
theory when the number of times of observation increases (T — =»). Bartlett 
has shown that the number ni; of times that the observed sequence was in 
state t at time t — 1 and in state f at time tl, fort=l,-,T,is asymptotically 
normally distributed in the ‘positively regular’ situation (see [3], p. 90. He also 


5. One observa 
asymptotic results were presen 
nN — wo, while T was fixed. The 
has been studied by Bartlett [3] 


has shown ([3], p- 93) that the maximum likelihood estimates pi; = n/n 
(n¥ = Yossi) have asymptotic variances and covariances given by the usual 
jiate to 6 Ni independent observations (1 = 1 Re 


multinomial formulas appropr' 
2,..., mm) from multinomial probabilities piU= b2-, m), and that the 
asymptotic co rariances for two different values of i are 0. An argument like 
that of Section 2.4 shows that the variables (mn)? (his — Di) have a limiting 
normal distribution with means 0 and the variances and covariances given in 
Section 2.4. This result was proved in a different way by L. A. Gardner [8]. 
Thus we see that the asymptotic theory for T — » and n = 1 is essentially 
the same as for T fixed and ni(0) — . Hence, the same test procedures are 
valid except for such tests as on possibly nonstationary chains. For example, 
Hoel’s likelihood ratio criterion [10] to test the null hypothesis that the order 
of thecchain iS? — 1 against the alternate hypothesis that it is Tr is parallel to 
the likelihood ratio criterion for this test given in Section 3.3. The x -test for 
this hypothesis, and the generalizations of the tests to the case where the null 
hypothesis is that the process is of order u and the alternate hypothesis is that 
the process is of order ru < r), which are presented in Section 3.3, are also 
applicable for large T'. Also, the x -test presented in Section 3.1 can be generalized 
to provide an alternative to Bartlett's likelihood ratio criterion [3] for testing 


the null hypothesis that pij.-«t = Ds xt (Specified). 
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6. 2°-tests and likelihood ratio criteria. The X*-tests presented in br Tn 
are asymptotically equivalent, in a certain Sense, to the corresponding i, es 
ratio tests, as will be proved in this section. This fact does not seem a eo 
from the general theory of X-tests; the 2 -tests presented herein are bi ty 
from those x-tests that can be Obtained directly by considering the ৰ 5 
individuals in each of the m” possible mutually exclusive sequences (See eC Te 
2.1) as the multinomial variables of interest. The x*-tests based on m catego ট। 
need not consider the data as having been obtained from a Markov bie Tl মু 
the alternate hypothesis may be extremely general, while the x-tests presente 
herein are based on a Markov chain model. 

For small samples, not enou 


[Sts are to be preferred (see comments in [5]). The relative rate of EDDrOIGN 
er of the tests for smal 


tingency tables, is that, for many users of these methods, their motivation 

We shall now prove that the likelihood ratio and the x’-tests (tests of ho- 
mogeneity) presented in Section 3.2 are Asymptotically equivalent in a certain 
Sense. First, we shall show that the x°-statistic has an asymptotic x*-distribution 
thod of proof can be used whenever the 
imiting normal distribution. In particular, 
form x; (see (3.6)). In order to prove that 
Which are formally similar to the likelihood 
ly likelihood ratios, have the appropriate 


this will be true for statistics of the 
statisties of the form MN, (see (3.7)), 
ratio criterion but are not actual 
asymptotic distribution, We shall then show 
equivalent to the xi-statistic, and therefore 
under the null hypothesis. Then Wwe shall di 
lence of the tests under the alternate hypothesis. T FE 
here can be applied to the appropriate statistics given in the other sections 
herein, and also Where T —> wo as well as where n > 0. 

Let us consider the distribution of the X-statistic (3.8) under the null hy- 
pothesis. From Section 2.4, we See that nv? (pit) — P:i) are asymptotically 
normally distributed with means 0 and Variances p;;(1l — Dii)/milt — 1), ete., 


Where mi(t) = éni(t)/n. For different t or different t, they are asymptotically 
independent. Then the [nmi(t — 1] 2:0) — 


Pi] have asymptotically vari- 
by tn ums J BEB = To 0 = BOS Then 
by the usual X-theory, nm (t-—- Dbl) — pT? 0% bas, an asymptotic 
xX -distribution under the null hypothesis. But 
(6.1) 


p lim (pF; le Bs) = 0 
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because 
(6.2) p lim (0 - (0) ='0: 
n 
Trrom the convergence in probability of (pi; — pi) and (mit) — nilt)/n), 


* (p;5(0) — pis) has a limiting distribution, it follows that 
bad) — DE) yu = D(Ps(d i = 0. 
Dis Di 

me asymptotic distribution as Ynmi(t — 1) 
tribution. This proof also indicates that the 


distribution. We shall now show that 
under the null hypothesis; 


and the fact that 
(6.3) plim B> mt — DO 


Hence, the x -statistic has the sa 
[p(t — pEP/Di; that is, a x -distributi 
xX:-statistics (3.6) also have a limiting x 
—2 log A; (see (3.7)) is asymptotically equivalent to Xi 
and hence will also have a limiting x -distribution. 
We first note that for |e] < ¥ 
64 (A+hbsgQt+m)=(0t+ 2) = #/24 2/8. = 2/4 5) 
24 2/2 — EG — 372 4 4), 
and 
(6.5) IQ t+mlog(lt+nm)-e- 2/2 
(see p. 217 in [6]). We see also that 
—2log A = —2 2 nit) log [pii/ p00] 
Jt 


|| 


|= [GOO —- #/2 + --- ESE 


(66) = 2X nit — 1 bald) log [ps(0/bal 
= 2) nt — Dp + 0] log [+ 00), 
where c(t) = (bill) — biil/his - The difference A between —2 log A: and the 
Xi-Statistic is 
A= —2l0gN = Xt 
রে = 2 Yin Dill + i500] log [1 + 500] — [2:0] /2}. 


Since Yor histii(D) = 0, 
(68) A= 22 nut Dill 
st 
We shall show that A converges to 0 in probability; i.e. for any € > 0, the 
probability of the relation [Al < 6 under the null hypothesis, tends to unity as 


n= Yin) = =. The probability satisfies the relation 
PrlISI<dEPrlAlS eand |i) <3) 

(6.9) Prf 12 Dia nit — Dhilzi(OP | < eand|zxi0)l <i 

Pr{2n Xj. it) F< end 2530) | < 3). 


+ x50] log + oil) — [z5,00]/2}. 


IV IV IV 
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It is therefore necessary only to prove that nlz;5(0)]’ converges to 0 in prob- 


ability. Since z;5(t) = [bii(t) — biil/pi;s converges to zero in probability under 
the null hypothesis, and 


(6.10)  VzxilD)n 2500) = V zi(t)n Lt — | ls |, 


be 
it follows that 


(6.11) nlzi(h] = ((wit)n)y"® 250] 


converges to zero in probability when the null hypothesis is true. Q.E.D. 
Since the x-statistic has a limiting x 


g the asymptotic equivalence of —2 
the asymptotic equivalence of sta- 


er the null hypothesis. 
ere the null hypothesis is true. 
ypothesis is true; that is, pii(t) = pii(s) for some 


ts, 1, jf. It is easy to see that both the x-test and the likelihood ratio test are 
consistent under any alternate hypothesis. 


€ significance level are kept fixed, then as 1 
increases, the Power of each test tends to 

In order to examine the situation in whi 
Samples and also to make comparisons between tests, the alternate hypothesis 
may be moved closer to th 
Pill) for the alternate hypothesis re not fixed but move closer to the null 
hypothesis, it can be seen that the two tests Are again asymptotically equiva- 
lent. This can be deduced by a slight modification of the proof of asymptotic 
equivalence under the null hypothesis given in this section (see also [5], p. 323). 
EEest another approach to the comparison of these tests when 
the alternate hypothesis is kept fixed. Since the null hypothesis is rejected 
When an appropriate statistic ( or — exceeds a specified critical value, 
we might decide that the X-test is to be preferred to the likelihood ratio test 
if the statistic x* is in Some sense (stochastically) larger than —2 log A under 
the alternate hypothesis. 

Since ni(t) is a linear combination of multinomial variables, we see that 
nilt)/n converges in probability to its expected value Efn.(t)/n) = mit). Hence, 
X/n converges in Probability to 


(6.12) স্্ট 


Lit 


2 log 2) 


mit — D[pi(t) — Bl'/pi, 
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and (—2 log M)/n converges in probability to 


(6.13) 2 Y milt — pit) log [pi(0/Dl, 
where 
(6.14) B= Z piOmdt— D/L mt—-D=pPlin bs 


The difference between (6.12) and (6.13) is approximately 
(6.15) Simi ( — Dipislt) — Bl/(8Bt). 

Tnder the alternate hypothesis, these two stochastic limits differ from 0, 
and computation of them suggests which test is better. If (pii(t) — Bii)/D; is 
small, then there will be only a small difference between the two limits. When 
the alternative is some composite hypothesis, as is usually the case when XxX 
tests are applied, then these stochastic limits can be computed and compared 
for the simple alternatives that are included in the alternate hypothesis. 

This method for comparing tests is somewhat related to Cochran’s comment 
(see p. 323 in [5]) that either (a) the significance probability can be made to 
decrease as n increases, thus reducing the chance of an error of type I, or (b) 
the alternate hypothesis can be moved steadily closer to the null hypothesis. 
Method (b) was discussed in [3]. If method (a) is used, then the critical value 
of the statistic (x or — log A) will increase as 1 increases. When the critical 
value has the form cn, where ¢ is a constant (there may be some question as 
ther this form for the critical value is really suitable), we see from the 
remarks in the preceding paragraph that the power of a test will tend to 1 if 
c is less than the stochastic limit and it will tend to 0 if c is greater than the 
stochastic limit. Hence, by this approach we find that the power of the x-test 
can be quite different from the power of the likelihood ratio test, and some 
approximate computations can suggest which test is to be preferred. 

However, 2 more appealing approach is to vary the significance level so the 
ratio of significance level to the probability of some particular Type I error 
approaches a limit (or at least it seems that desirable sequences of significance 
points lie between ¢ and cn). While the usual asymptotic theory does not give 
enough information to handle this problem, the comparison of stochastic limits 
may suggest a comparison of powers. 

‘The methods of comparison discussed herein can also be used in the study of 
the x* and likelihood ratio methods for ordinary contingency tables. We have 
seen that, im a certain Sense, the x? and likelihood ratio methods are not equiva- 
lent when the alternate hypothesis is true and fixed, and we have suggested a 


method for determining which test is to be preferred. 


to whe 
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A STOCHASTIC MODEL FOR INDIVIDUAL 
CHOICE BEHAVIOR * 


R. J. AUDLEY 


University College, London 


a stochastic 
d with the 
ables 


This paper presents 
model which is concerne 
interrelations of the response vari 
observed in choice situations. The 
model is not a complete theory, be- 
cause it involves no assumptions about 
the relations between stimulus and re- 
sponse variables. However, for given 
stimulus conditions, the parameters 
of the stochastic process do provide a 
convenient summary of many aspects 
of behaviour in a choice situation. 
Furthermore, the most elementary as- 
sumptions about the way in which 
these parameters might vary with 
changed stimulus conditions lead to 
predictions which are in qualitative 
agreement with experimental findings. 
In a sense, therefore, the stochastic 
model can be regarded as a rudimen- 
tary theory of certain aspects of choice 


behaviour. 


Descriptors of Choice Behavior 


A wide variety of experiments re- 
quire the use of a situation involving 
a choice between two or more alterna- 
tives. There are several variables 
which may be employed in a descrip- 
tive summary of the behavior which 

1 The writer is grateful to A. R. Jonckheere 


for his generous criticisms during the prepara- 
K He and G. C. Drew 


tion of the manuscript. 
were also kind enough to comment upon an 
earlier draft. 


This article appeared i 


appears in these situations. These 
variables can be of two kinds. Firstly, 
there are descriptors of the primary 
response to the situation, and, sec- 
ondly, there are descriptors of the re- 
sponses which the S makes to his pri- 
mary choices. Those of the first kind 
are most commonly used and the three 
principal ones are: (a) Response time 
—the time taken for a definite choice 
to be made. (b) Relative response 
frequency—the proportion of occa- 
sions on which a particular choice re- 
sponse is made. (c) The number of 
vicarious trial and error responses 
(VTEs)—the number of vacillations 
between the various alternatives be- 
fore a definite choice occurs. In the 
second group, where the descriptor is 
usually a verbal statement by the ১5, 
there are such variables as: (a) con- 
fidence in the correctness of a given 
choice and (b) an assessment of the 
subjective difficulty of the choice task. 

Clearly, the extent to which these 
various descriptors can be employed 
will depend upon the specific details 
of an experiment. But, for many 
choice situations, all three descriptors 
of the first kind can be employed. 
Also in most studies with human Ss 
the second kind are also available. 
In fact, this paper will be mainly con- 
cerned with the first kind of descriptor, 
but some suggestions will be advanced 


n Psychol. Rev., 1960, 67, 1-15. Reprinted with permission. 
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which permit those of the second kind 
to be also included in a unitary sto- 
chastic description of choice behavior. 


Particular Choice Situations Which Are 
Considered 


It is believed that the underlying 
hypotheses upon which the stochastic 
description is based are applicable to 
most choice situations. However, the 
derivation of a mathematical model 
from these hypotheses which can be 
readily applied to experimental data 
without additional assumptions is 
more conveniently achieved for a cer- 
tain class of situations. This class 
Consists of experiments where knowl- 
edge of the outcome or Correctness of 
a response is not available to the S 
until after the choice has been made. 
Thus, for example, most ordinary dis- 
Junctive reaction time Studies are not 
considered because the S in these ex- 
periments can match his response with 
a known requirement. Nevertheless, 
the class of situations w 


Vhich can be 
considered is not a trivial one. It 


includes among others (a) Discrimina- 
tion experiments, including most con- 
ventional Psychophysical Procedures 
in this category. (b) Studies of prefer- 
ence and conflict. (c) Investigations 
of learning in choice situations. 

he next section of the paper is 
mainly concerned with the events sup- 
posed to be taking place during a 
single experimental trial. 


THE STOCHASTIC MopE1. 


The notions Upon which the model 
is based are very simple and involve 
only two assumptions: 

Assumption 1. Jt is first assumed 
that, for given stimulus and or, 
conditions, there 1s associated 
Possible choice response a 
eter. This parameter 
probability that in 


EAnismic 
With each 
Single param- 
determines the 
@ small interval of 
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time (C(t t+ At), there will occur an 
“Implicit" response of the Rind with 
which the parameter is associated. f 

No specific interpretation is given 
to the term “implicit response." It 
may, in certain circumstances, be 
taken to be equivalent to the partial 
response usually classified as a VTE. 
But there are some situations in which 
VTEs are not observed and would 
scem unlikely to be present. In these 
cases the “implicit” response may be 
regarded asa tendency to makea given 
response, or might perhaps be given 
some physiological interpretation. 

The probabilities of the various 
kinds of “implicit responses occur- 
ring are considered to be independent 
of one another. So that for given con- 
ditions, implicit responses of each kind 
Are appearing at random intervals un- 
affected by the appearance of other 
implicit responses. It follows from 
the first Assumption that the distribu- 
tion of the intervals between Succes- 
sive implicit responses of a given kind 
is exponential and is determined en- 
tirely by the response parameter 
[Le.g., see Feller, 1950, p. 220]. 

Assumption 2. It is assumed that 
a final choice response 1s made when a 
Tun of K implicit responses of a uve 
kind Appears, this run being uninter- 
7iubpled by occurrences of implicit re- 
SPonses of other kinds. K may either 
be assumed to take a particular value 
Or can be regarded as a further param- 
eter, which can be estimated from ex- 
perimental data. 

Assumption 1 has been employed 
before. Mueller (1950) has used this 
Dproach to describe the intervals be- 
tween bar-presses.in an operant condi- 
tioning experiment where only one 
"esponse is involved. For the same 
situation, Estes (1950) and Bush & 
Mosteller (1951) have used an as- 
SuUmption which is very similar, the 
only difference being that their models 
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used a discontinuous rather than a 
continuous distribution of responses 
in time. Christie (1952) in discussing 
the determination of response prob- 
abilities in a discrimination experi- 
ment, has used the same assumption 
for situations where two responses are 
competing. Finally, the author of the 
present paper (Audley: 1957, 1958) 
has previously used the same notions 
to combine response times and re- 
sponse probabilities in a stochastic de- 
scription of individual learning be- 
havior. However, in all these ex- 
amples, it has been assumed that 
K = 1. Bush and Mosteller (1955), 
in an analysis of response times in a 
runway situation, have considered a 
continuous model with K > 1, but 
this generalization does not appear to 
have been previously employed in a 
situation involving choice. 

There are several reasons which can 
be advanced for assuming that K > 1. 
Firstly, when K = 1, butnotif K > 1, 
the distributions of response times 
for all alternatives can be shown to 
be identically the same, and are ex- 
ponential (e.g., see Audley, 1958). 
Neither of these properties is in 
agreement with experimental findings. 
Secondly, when K > 1, the sequence 
of “implicit'" responses occurring be- 
fore a final choice is made offer a 
possible means of including VTE:'s 
within the description of choice be- 
havior. Thirdly, classification of the 
nces of “‘implicit’’ choice 
suggests an approach to descriptors of 
the second kind. For example, “‘per- 
fect confidence” in a choice might be 
identified with sequences consisting of 
Jicit’’ responses of one kind only. 


Various seque 


“inp 
Derivation of the Stochastic Model 


No further assumptions are required 


in the derivation of the model, which 
can be applied to situations involving 
any number, mm, of choices. However, 
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in order to keep the exposition as brief 
as possible, consideration in this paper 
will be limited to situations involving 
a choice between only two alterna- 
tives, i.e., m = 2. Furthermore, the 
mathematical problem is relatively 
simple when K = 2, so that only this 
special case will be presented. Re- 
sults for the more general case have 
been derived and will be elaborated 
elsewhere. 

The two-choice situation with K = 2. 
The two possible responses will be 
called A and B, and implicit responses 
of the two kinds will be labelled a and 
b respectively. Let the parameters 
associated with the two responses be 
« and B. Assumption 1 means that 
p(a), the probability of an a occurring 
in a small time interval (t,t + Al) is 
given by: 


D(a) = adt [1a] 


Similarly 


Pb) = BA [1b] 


The probability p(a or b), of an im- 
plicit response of either kind but not 


both, occurring in the small time in- 
terval is 
p(a or b) =p(a) +200) —2p(a)p (6) 
= (0+B)At— 208 (At)? 


Hence 
blaorb)=(e+8)At [lc] 


if terms of order (At)? are ignored. 
This becomes possible if a transition 
is made to the continuous case when 
the distribution in time of implicit re- 
sponses follows that of a Poisson proc- 
ess (e.g., see Feller, 1950, p. 220). 
Therefore the probability, p(n, t), of 
obtaining # implicit responses in the 
time interval (0,1) is (e.g., again see 
Feller, 1950, -p. 221): 


(a + B)"tne—(et)t 
FE RL 


PA, f) = 
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In particular the probability, p(o, ), 
of obtaining no implicit response of 
either kind in time tis given by: 

2(o, t) = e—(a+8)t 


[3] 


The probability: P,, that the first 
implicit response to occur is an Gis 


P= fe b(o, tadt 
t=0 


lad a 
= — (24-8) =——— a 
= Ys (atid 3 [4a] 
= say, p 
Similarly, for implicit b responses 
B = 2 = — 
P= PEAY =1 2 [4b] 


Since occurrences of implicit re- 
sponses follow a Poisson process, 
Equations 4a and 4b also give the 
Probability that, starting at any Eiven 
moment, the next implicit response to 
occur will be an a or b respectively. 
Therefore, ignoring for the moment 
questions concerning the time inter- 
Vals between successive implicit re- 
sponses, the sequence of events lead- 
ing to a final choice Can be treated as 
a sequence of independent binomial 
trials, with the Probabilities, P, and 
Pi, of the two types of event given by 
Equations 4a and 4b. 


The Probability, PA, That the Final 
Choice is an A Response 


The possible sequences which ter- 
minate with the Occurrence of an 4 


can be easily classified when K = 2, 
For they mus i 


tions between a and b, until two suc- 


The early members 
Uences are: aa, baa, 
The respective prob- 
abilities of these Various sequences is 
clearly: #2, pg, bg, pig? etc. The 
over-all probability, Pa, that the final 
choice is an A, is the sum Of this infi- 
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nite series of sequence probabilities. 
Thus, 


Pa=pP+ p+ p+ pet... [5] 


Whence, simplifying, and substituting 
for p and g from Equations 4a and 4b 


[a + 28] G4 
Le + 6IL(e + 8)? — a8] [] 


Similarly 


ৰ B20 + 8] 6b 
* = 4 BIL +B? = as] C61 


Equation 6a may be written in the 
following form: 


PE - HE _Y [le +6) — #0] 
“7 e+ 8B [e+ 6) — as] 


so that when « > Bi Pi and 


B 

bey Fa 

Thus the difference between the 
probabilities of the various implicit 
responses occurring is accentuated in 
the expressions for the probabilities of 
Overt choice responses. The accentu- 
ation increases with K and implies 
that there is more certainty in the 
Overt choices than in the underlying 
Processes which determine them. This 
is believed to be a property which 
many organisms exhibit. 


a 
ar 


Vicarious Trial and Error 


If we identify alternating appear- 
ances of the “implicit” responses, @ 
and b, with VTEs, the moments of the 
distribution of VTEs can readily be 
Obtained. Attention here will be CO: 
fined to the mean number of VTEs 
Preceding (a) any choice (b) a par- 
ticular choice. 


The Mean Number of VTEs Preceding 
Any Choice, V 


There are no VTEs if the sequence 
of implicit responses is aa or bb. 
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There is 1 VTE if the sequence is 
baa or abb. 

There are 2 VTEs if the sequence is 
abaa or babb, and so on. 

Dividing the sequences of implicit 
responses into those with an odd num- 
ber and those with an even number of 
VTEs, the following probabilities are 
found (letting P(V = m1) be the prob- 
ability of obtaining # VTEs): 


MVY=0)=Pt+e 


P(V = 2) = bg + bf 
P(V = 4) = pq: + pq’ 


etc 
P(V =1) = bq +t Bq 
P(V=3)= PE + pe 
P(V = 5) = b'f + bq’ 
etc. 
Now 
V = P(V = 1) + 2P(V = 2) 
+ 3P(V = 3) + --- 


and after some algebraic manipulation 
and again substituting for p and gq 


from Equation 4a and 4b. 
a 3a 
5 পতল ত ন Tf 
FERRE 


If4 = E then Equation 7 may be re- 
Q 


written as 
বট 3+ 
MeN E SHER 


Thus V is dependent only on the ratio 
of B to a, and becomes a maximum 
when Y = 1, i.e., « = B. Therefore 


the number of VTEs would be a maxi- 
ঠি 


mum when PA = Ps = z- 
The Mean Number of VTEs Preceding 
A and B Responses, Va and Vs 


Separate consideration of the mean 
number of VTEs preceding an 4 and 
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B choice yields the following results: 


7 208 B 
P= GFA 32 [8] 
ৰ 2aB a 
EB) aE REE 
ys B 
Since TED ).. and তুলত may be re- 
written as চ and ) 2 respec- 
242 -4+2 
B ua 


tively, it can be seen that on the 
average there would be fewer VTEs 
preceding the response which is domi- 
nant at any given moment, lie, I 


Pi > Pp, VA< Vs: 


The Time Distribution of Final Choice 


It is possible to determine all the 
moments of the time distribution of 
final responses. Here, however, con- 
sideration will be limited to the mean 
latency, L, of all responses and the 
mean latencies for A and B re- 
sponses taken separately, La and Ls 
respectively. 


The Mean Latencies for A and B Re- 

sponses, La and Ls 

Let P(a, t) be the probability that, 
at time t, no two consecutive a's or 
b's have appeared, and that the last 
implicit response Was an @. Let 
P(a, t;n) be the probability that, at 
Line t, no two consecutive a's or b's 
have appeared, and that the last im- 
plicit response was an Q, and also that 
there have been exactly n implicit 
responses. Thus 


PUD = Blond) 
n=l 


To determine P(a,t;n), Equation 2 
and the method employed to find Pa 
are combined. 

Let P(G; n) be the probability that 
a sequence of n events ends with an @, 
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Ho two consecutive a's or b's having 
occurred. Clearly, 


PG;1) = 
0), es OO 
0) = EG 


P(G@;3) = জক etc. 


these probabilities being respectively 
associated with the sequences; a, ba, 
aba, etc. 


Now P(a,t;n) = Pn, 0). P(a;n), 
and Equation 2 gives P(n, 1), so that 


PG, tL) 
=P) -Plo 1) 


= p—(atp)t, 
(0 + B)te ( EE 


B 

ale— (ath) 

) 1! 

P(a,t; 2) 

(a + B)Pe-tatst ap 

HE = 

21 (a + B): 

afte (ats) 

= ELE 

Similarly 


4 30— (a+8)t 
Pla, 3) = CORE G2 


3! 


etc. Hence 


P(a,t) = 3 Pla,t;n) 


ale™(at+8)t 


afBt?e- (at+8)t 
+ 


Fl 
A°BEe- a+) 
Een Le 
which, upon simplification, Eives 
tVaB y -VEE 
Pa, SE +e Vas ৷) 


( etVet _ o-Vas 
Do se )] tou 
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Similarly it may be determined that 


Val Ls tVab 
Pb, oc [( i 1) 


2 


oY ~ )| [9] 
[4 € 
Now 


ET =f P(a, dear) f Pla, tadt 
t=0 t=0 


2254-56) 
(a + 8B): — ag 


B 


ONE = SESE 10: 
FEF EI Lol 
and similarly 
Eo 2 Set) 
of Hine লন বত 
(a + B)*: — ap 
pein Mic = 10k 
tHE C08] 


By the same kind of argument it 
may be demonstrated that the mean 
latency for all responses, L, is given by 


Le 2at6) a8 2 
La +8 (+BY —aB]™ a+ 


SOT EY 
+ [11d 


a+ BIL («+ 8): — af] 

Returning to Equations 10a and 
B 

(a +B) (a+ 28) 


10b it can be seen that 


Ld . S 
and [CET may be written as 
1 


টে চর and B Lol 
«+91( 542) (a4:8)( £42) 
spectively. Thus the dominant re- 
Sponse will, on the average, have a 
shorter choice time than the other, 
Les if Pi > Pail < bs. 4 

In order to compare the theoretical 
Tesponse time distribution to observed 
data, the probability P(0, 1) of no final 
response having occurred by time t is 
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also given. This is clearly 

P(0,t) = Plot) + Pad) + POD 
P(o, 1) is given by Equation 3 and 

P(a,t) and P(b,t) by Equations 9a and 

9b so that, upon some simplification, 


P(0, 1) = etn ee =) 


+ (eVaft | [12] 


The Model and Descriptors of the 
Second Kind 


At present, it is only possible to ad- 
vance some speculations concerning 
variables such as “‘degree of confi- 
dence’’ in the correctness of a given 
choice. Nevertheless, it seems worth 
considering these since there appears 
to be a definite relation between the 
second kind of descriptor and the more 
conventional indices of choice be- 
havior. Henmon (1911), whose paper 
will be considered in more detail 
later, showed that choices regarded 
by an S with confidence are generally 
quicker and more accurate than others. 
This result was demonstrated in a 
sical discrimination situa- 


psychophy 1 
tion where a definite correct choice 
existed. 


0 be two possible ways 
in which “confidence" might be at- 
tributed to a particular choice. The 
first of these involves some classifica- 
tion of the various sequences of im- 
plicit responses preceding a final 
choice. For example, sequences which 
involve no vacillation at all, such as 
aa, or bb, might be regarded as “more 
confident” than sequences involving a 
large number of vacillations, such as 
abababaa. It will be shown that this 
kind of “‘confident'" sequence has the 
properties required by Henmon's data. 

For, suppose A be the correct and B 
rect choice in a psychophys- 


There seem t 


the incor 
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ical situation, then generally speaking 
one would expect « > B. The prob- 
ability of the sequence aa would be 


[ee যা and the probability of bb, 
ei Sis 
+5 . Hence, the probability, Pc, 
of being correct for this type of con- 
fident ‘“‘choice,” i.e., choosing A, is 
given by 
P ৰ [13] 


Gf? 


Comparing this probability with the 
overall probability of an A response, 
P. given by Equation 6a, 


Pe—Pa 
a «(a+ 28) 
= 2482 [a+BIL(e+8):— a8] 
a*B*(a—B) 
= [24 Ba + BIL (a +B): — a8] 


[14] 


Clearly, Equation 14 is positive when 
« > B and hence Pc > P.. 

Since for these “‘confident" responses 
only two implicit responses occur be- 
fore a final choice, it is clear that their 
mean response time is shorter than the 
over-all average response time. This 
approach consists essentially in equat- 
ing “degree of confidence" with some 
function of the reciprocal of the num- 
ber of VTEs preceding the final choice. 

The second suggested approach to 
judgmental confidence is based upon 
the fact that these appraisals of a re- 
sponse, under normal instructions, fol- 
low after the response itself. Degree 
of confidence, therefore, might be as- 
sociated with implicit responses con- 
tinuing to occur after an overt choice 
response has occurred. If, after an l 
response has been made, a further a 
occurs in the time before the state- 
ment of confidence is produced, this 
might be taken to lead to greater con- 
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fidence than if nothing or a b appeared. 
Indeed, it might be possible to develop 
a model for the distribution of the 
times between making the primary 
choice response and giving an esti- 
mate for degree of confidence from 
this kind of assumption. 

Other approaches to the second kind 
of descriptor are undoubtedly possible 
within the present scheme. The im- 
portant point is that it is possible to 
test these various hypotheses quite 
easily. They each predict how often 
a given level of confidence would be 
employed. Also the expected distri- 
bution of descriptors of the first kind 
associated with each level of confi- 
dence can be determined. 


THE AGREEMENT BETWEEN THE 
PROPERTIES OF THE MopEL 
AND EMPIRICAL DATA 


The principal aim of this paper is 
to show that a set of very simple as- 
sumptions can be used to derive rela- 
tions which might be expected among 
the variables observed in a choice 
situation. In an exposition of this 
kind it is not possible to examine, in 
any detail, the success of the model in 
describing the results of experiments 
Which are relevant. For one thing, 
only the particular Case arising when 
K = 2 has been Presented, whereas in 
practice it may be more profitable to 
treat K as a parameter. Also, the 
Argument so far presented is concerned 
with the events Supposed to occur at 
a single experimental trial. The 
manner in which the model is applied 
to experimental data based upon a 
number of trials will depend very 
much upon the way in which separate 
trials resemble one another. 
may be actual iati 


conditions from trial to trial, or there 


L Ct dependence of later 
upon earlier trials, as in learning ex- 


periments. For this Teason, considera- 
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tion of quantitative evidence will be 
mainly confined to an experiment by 
Henmon (1911), in which the condi- 
tions under which individual trials 
Were conducted closely resemble one 
another and where it can reasonably 
be assumed that there are no Sys- 
tematic changes in an S's behavior. 
This data can therefore be regarded 
as appropriate for testing the model 
Without there being any need to make 
further special assumptions. How- 
ever, before examining Henmon Ss re- 
sults, it seems worthwhile to exhibit 
the manner in which the model seems 
to match empirical evidence about 
choice behavior in general. 

In effecting a general appraisal of 
the model, one is hindered by the 
general lack of individual results in 
the experimental literature. For rea- 
sons which cannot be examined here 
it seems preferable to test hypotheses 
about functional relations upon indi- 
vidual data. A brief argument for this 
Point of view has been presented by 
Bakan (1955) and for the study of 
learning behavior by Audley ৰ and 
Jonckheere (1956). The reader is re- 
ferred to these papers for further de- 
tails. However, irrespective of the 
stand taken on this question, it 1s 
clear that the present model is con- 
cerned with individual results and that 
Such results are not generally avail- 
able. For this reason, the following 
comparison of the model with experi- 
mental evidence is largely qualitative, 
although, given appropriate data, 
quantitative comparisons would have 
been possible. 


Psychophysical Discrimination Situa- 
tions 


In considering results from psycho- 
Physical experiments, say using the 
constant method, it is necessary be 
Consider separately the comparison 0 
each variable with the standard. This 
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is so because no assumptions have thus 
far been made about the relation be- 
tween stimulus and response variables. 
In spite of this, some general predic- 
tions can be made. 

Consider the results obtained from 
the comparison of the standard with a 
particular variable stimulus. In this 
comparison, it can be supposed that 
the responses A and B refer to the 
respective statements “the variable is 
greater than the standard" and “the 
variable is smaller than the standard." 
« will clearly be a monotonically in- 
creasing function of the magnitude of 
the variable, and B a monotonically 
decreasing function of the same mag- 
nitude. At the PSE, «.= B. Within 
limits, and certainly for a range of 
stimuli close-to the. PSE; (e+ B) can 
be assumed to be approximately con- 
stant. This supposition is not crucial, 
but simplifies the ensuing argument. 


Relation of Judgment Time to the Per- 
ceived Distance between Stimuli 
Equation 11 gives the mean choice 

time as a function of « and B. This 

can be rewritten in the following way : 


3 


2 
+ TB: [15] 
Et cera SE -] 


If (a+8) is approxima 
L will depend principally upon the 
product of the parameters, af. Thus 
L will have a maximum when «=. 


From Equation 6a it can be seen that 
the point, « = B, also defines the PSE, 


since for these parameter values Pa 
= Pp = 0.5. It can be seen that deci- 
sion time will therefore rise mono- 
tonically up to the PSE and then de- 
crease monotonically beyond the PSE. 
For the range and distribution of 
stimuli employed in most psycho- 
Physical studies, the decrease in deci- 
sion time upon either side of the PSE 
will be, according to the model, 


L= 


tely constant, 
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approximately symmetrical. These 
properties are in agreement with em- 
pirical data, as for example summar- 
ized by Guilford (1954). 

Even where the S is allowed three 
categories of response, it is the bound- 
aries between these categories which 
show peak decision times (Cartwright, 
1941). This would be expected if a 
further parameter be used to charac- 
terize “equal” or “doubtful” responses. 
It would be of great interest to deter- 
mine whether, in fact, a further re- 
sponse parameter is required when a 
third response category is permitted. 
Almost by definition, the response 
“doubtful” implies that no decision 
has been reached by a certain time. 
Such responses would then appear to 
be best described by the time which 
the Sis willing to spend in attempting 
to come to a decision. This would 
make the range of stimuli over which 
judgments of “doubtful” are made 
depend only indirectly upon differ- 
ential sensitivity. The readiness of 
the S to continue attempting to arrive 
at a definite answer would also play 
an important role. This is in accord 
with the generally accepted view of 
the use of a third category, e.g., Wood- 
worth (1938), Guilford (1954). On 
the other hand, a parameter to specify 
judgments of “equality” may still be 
required. This would allow for a time 
determined “‘doubtful" judgment of 
the kind discussed above, but would 
also introduce a true “‘equals'' cate- 
gory. This would enable an analysis 
of the third category to be carried out 
in accordance with the suggestions of 
Cartwright (1941) and George (1917). 


The Relation between Confidence, Deci- 
ston Time and Perceived Distance 
between Stimuli 


The exact nature of the relations 
between. the variables considered in 
this section, will depend upon whether 
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stimulus conditions are the same for 
all trials. Nevertheless, some general 
predictions can be advanced. 

Here, “degree of confidence" will be 
equated with some function of the 
reciprocal of the number of VTEs pre- 
ceding a final choice. The number of 
VTEs can, of course, range from zero 
to infinity. Generally speaking, con- 
fidence is rated upon some scale from 
zero to unity. Let C, be the degree of 
confidence associated with a given 
choice, and, V, the number of VTEs 
preceding this choice act. Determin- 
ing a suitable relation between C and 
V would, in fact, be one of the experi- 
mental problems suggested by the 
present approach. For the moment, 
however, it will be assumed that, 

tL CEE 
= [16] 
so that when V = 0, C = 1; and when 
V=%,C=0. 

It will be recalled from the section 
concerned with VTEs that the mean 
number of these will, when K = 2, be 
two less than the number of implicit 
responses preceding a final choice. 
Now it can easily be demonstrated, 
using Equation 1c, that the mean 
choice time when n implicit responses 
occur, Ti, is given by 


[17] 


Whence, since V = # — 2, and be- 
Cause 1 is eliminated from Equation 


i KG it is Possible to express the me 
choice time T 


given by 


an 
+ As a function of V, 


TT Ld 2 
T= ক [18] 
Substituting for V from Equation 16 
and adding an arbitrary constant, T' 
for the minimum choice time Foss ble: 
re 1 
(a+ BCTtEFBT Ts 


[19] 
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This hyperbolic function is in agree- 
ment with experimental determina- 
tions of the relation between confi- 
dence and judgment time, e.g., see 
again Guilford (1954). 

If the stimulus conditions are varied 
between different sets of trials, as for 
example in the constant method dis- 
cussed in the previous section, general 
conclusions are again possible. For 
in discussing Equation 7, it was shown 
that the mean number of VTEs de- 
pends only upon the ratio of a to B. 
“Again assuming that (a + B) is ap- 
proximately constant, V would be a 
roughly symmetrical function of the 
magnitude of the variable, having a 
maximum at the PSE. Thus the 
average degree of confidence, ©, would 
be a roughly U shaped function hav- 
ing a minimum at the PSE. Since 
choice time has been shown to have 
a maximum at the PSE and to de- 
crease upon either side of this point, 
C and T would again vary inversely. 
This agrees with experimental data 
(see Guilford, 1954). 


Preference and Conflict Situations 


In this kind of situation, a number 
of objects are paired and the subject 
makes a choice indicating the pre- 
ferred object of each pair. For any 
fiven pair of objects, say A and B, 
the parameters « and B can be taken 
to represent some measure of prefer- 
ence for A and B. Because there are 
a number of objects, it is more con- 
venient to label ther objects presented 
to the subject as Xi, and to let the 
parameter associated with a kind of 
“absolute preference" for each, be 
«i (1 =1,2,.-.7). The « and B of 
the equations will now be replaced by, 
say a; and ax, for the comparison of 
the ith and jth objects, X; and Xr. 
This, of course, is to make the VEY 
strong assumption that the a;'s are in- 
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dependent of the particular compari- 
son in which they are involved. This 
assumption could be readily tested by 
using the model appropriately, and is 
accepted here only in order to simplify 
notation. The results of the following 
argument would be qualitatively the 
same, even if there were in fact, 
contextual effects peculiar to each 
comparison. 

Variation in choice time among dif- 
ferent comparisons. The set of r ob- 
jects, on the basis of a paired compari- 
son technique, can usually be ranked. 
Let i be an individual's ranking of 
an object, so that we may write 
ND No 3. DEK AS Ka Ds 2 Ror 
meaning X; is preferred to X2 and so 
Of. This: means that ai > 2 2 
>i > as > Pr Consider any" 
pair of parameters, say «j and ax, and 
let these be the « and B of the earlier 
equations. Then the mean choice 
time is given by Equation 11, and this 
can now be rewritten as 


LES 
aj T+ ak 
Saja 


+ ET ales + a) — aed 


Clearly Lu. depends upon two things; 
the sum of the parameters (a; + ax) 
and, secondly, the product of the pa- 
rameters, ajar. Other things being 
equal, the choice time will decrease 
as (aj; + ar) increases. Again, with 
(a; + ax) constant, Lun will increase 
with the product, reaching a maximum 
when a; = ax. Choice time will there- 
fore (a) depend upon the general level 
of preference for objects, being quicker 
for preferred objects, (6) will be quicker 
the greater the difference in preference 
for the two paired objects. This in 
agreement with experimental finding, 
e.g., for children choosing among 
liquids to drink, Barker (1942), for 
aesthetic preferences, Dashiell (1937). 


Lon 


[20] 
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It will be interesting to determine 
how far the assumption of an absence 
of contextual effects can be main- 
tained. If the assumption turns out 
to be approximately true, then the 
parameters, o;, would provide a means 
of scaling the stimulus objects for a 
given individual. In essence, such an 
approach would resemble that adopted 
by Bradley and Terry (1952), but 
would have the added advantage that 
the scale values would have an abso- 
lute rather than a relative basis, so 
that the scale values should be un- 
affected by the inclusion of new 
comparisons. 

Number of VTEs for different com- 
parisons. It was shown, in discussing 
Equation 7, that the mean number of 
VTEs in a given situation, depends 
entirely upon the ratio of « to B. 
Using the present notation this would 
be the ratio of a; to ax, for objects X; 
and X:. The number of VTEs has a 
maximum when «j= an and de- 
creases as the values of the parameter 
become more disparate. Thus the 
number of VTEs should depend en- 
tirely upon the differences in prefer- 
ence and not upon the general level of 
preference for the two paired objects. 
Thus for adjacent objects, X: and 
Xi, the number of VTEs before a 
final choice will not rise with choice 
time as one proceeds from preferred to 
nonpreferred objects. This is slightly 
complicated by differences in “‘prefer- 
ence distance'’ between adjacent ob- 
jects, but the prediction is again found 
to be in agreement with experimental 
evidence, e.g., see Barker (1942). 

Learning in choice situations. It is 
in considering learning behavior that 
the need for individual results is 
greatest (Audley & Jonckheere, 1956). 
The full advantages of the present 
approach to response variables can 
only be gained by incorporating the 
assumption in a stochastic model for 
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learning. The way in which this 
might be contrived, when K = 1, has 
already been outlined and illustrated 
elsewhere (Audley: 1957, 1958). On 
the whole, therefore, the experimental 
literature does not provide results in 
a way which enable the predictions of 
the model to be falsified, even at a 
qualitative level. The most that can 
be done here is to show that the pre- 
dictions might well be good approxi- 
mations to the properties of learning 
data. 

Given a particular theory of learn- 
ing it would, of course, be possible to 
anchor the theory more closely to re- 
sponse variables by identifying the 
parameter of the choice model with an 
appropriate theoretical construction. 

The properties of the model and 
simple learning behavior. Consider, 
for example, learning in a simple two- 
choice situation. Let « be associated 
with A, the correct response, and B 
with B, the incorrect response. The 
Way in which « and B vary with re- 
ward and punishment is naturally a 
matter for investigation and would 
certainly condition the form of the 
prediction which would be made. 
Nevertheless, it is not unreasonable to 
assume that a will be some monotonic 
increasing function, and B some mono- 
tonic decreasing function of practice 
and of punishments and rewards. 


Let it be supposed that the S has 
at first a strong tendency to produce 
the incorrect choice, i.e., « is small 
relative to B. Consider, firstly, what 
might be expected to happen to the 


over-all latency L, and the latencies 
of A and B, Li and Lp respectively. 
In discussing 


I Equations 10a and 10b 
1t was shown that the dominant re- 
sponse, on the average, will have the 
shorter choice time. Thus in the first 
place it will be expected that La will 
be greater than L. until the prob- 
ability of making the Correct choice, 
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P, reaches and exceeds 0.5, when L. 
will be generally shorter than Ls. 

All of the latencies are dependent 
upon two factors, the sum (2 + B) and 
the ratio of « to B. The over-all la- 
tency, L, if (« + B) remains constant, 
will rise to a maximum until Pa. 
= Ps = 0.5 (i.e., «= B) and then 
fall again. Superimposed upon this 
rise and fall will be the influence of 
(«a + B), and if the levels of, say 
punishment and reward, are such as 
to disturb the constancy of this quan- 
tity, then there will be an accentua- 
tion or flattening of the curve of 
latency as a function of practice. The 
monotonic decline in response la- 
tencies observed when an S is intro- 
duced into a learning situation for the 
first time does not counter this predic- 
tion. For, then, it is to be expected 
that (a + B) will be initially small and 
the effect of increasing «, and, hence, 
(a + B) will be reinforced by the grow- 
ing difference in magnitude between « 
and B. In original learning, therefore, 
the two factors work together and 
Produce the monotonic decrease in 
latency. 

The number of VTEs, from Equa- 
tion 7, is seen to be a function only of 
the ratio of a to 8B. Thus VTEs would 
be expected to rise to a maximum until 
«= 3B, ie., Ps = Pp = 0.5, and the 
decline. 

These predictions are probably only 
applicable to the very simple two- 
choice situations so far considered. 
For discrimination studies, the prob- 
lem is complicated by the way in 
Which the relevant cues are being 
utilized by the organism and there is 
no point in reviewing the controversy 
over this matter. It does however 
seem worthwhile pointing out that, in 
discrimination behavior, it is very 
probable that there appears sore! IE 
like the problem of the use of the thir: 
category in psychophysical proced- 
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ures. That is, a distinction seems to 
be necessary between, on the one 
hand, a definite act of choice and, on 
the other hand, behavior which occurs 
simply because something has to be 
done in the situation. This specula- 
tive point is raised because the size 
of the parameters may exert an influ- 
ence upon behavior in two ways. 
Firstly, by determining the prob- 
ability of making a particular response 
when a “‘true” choice is made and, 
secondly, by determining the prob- 
ability that a “true” choice is made. 

Henmon’s experiment. The experi- 
ment conducted by Henmon (1911) is 
of particular interest, because it pro- 
vides data from individual Ss, in a 
situation where stimulus conditions 
can be assumed to be fairly constant 
from trial to trial. The observations, 
therefore, are important for any model 
concerned with the properties of 
choice behavior. 

Henmon required Ss, in each of 
1,000 trials, to decide whether one of 
two horizontal lines was longer or 
shorter than the other. The lengths 
of the lines were always 20 mm and 
20.3 mm respectively. In addition, 
Ss were instructed to indicate their 
confidence in each judgment. 

The model is qualitatively in agree- 
ment with Henmon’s data, except in 
two things. Firstly, although aver- 
age choice time for wrong responses is 
larger than that for correct choices, as 
predicted by the model, the wrong 
responses are relatively quicker in 
each category of confidence. The 
second qualitative difference appears 
in examining accuracy as a function 
of time. There is some indication for 
some Ss that although there is a 
general decline in accuracy with longer 
choice times, again predicted by the 
model, there is also a slight rise in 
accuracy in going from very short to 
moderately short choice times. It is 
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possible that both of these differences 
may be accounted for by a suitable 
analysis of judgments of confidence 
about which only a few speculations 
have been advanced in the present 
paper. The important point, it seems 
to the author, is that the general 
stochastic model is capable of dealing 
with this kind of issue, rather than 
that it succeeds in all details at the 
present time. 

Henmon gives the distribution of 
all choice times for each individual. 
Since this can also be derived from 
the model, a comparison of the two 
distributions should give further indi- 
cations as to the adequacy of the 
present approach to choice behavior. 
In testing the goodness of fit of the 
model in this matter, it would be 
usual to estimate the parameters from 
the distribution of choice times alone. 
However, it was decided that perhaps 
a stronger case could be made out if 
the only time datum used to estimate 
the parameters was the mean latency. 
Two equations are of course required 
if values of « and B are to be deter- 
mined, and Pa, the probability of a 
correct response, was chosen for the 
second. Accordingly the present esti- 
mates are based upon Equations 6a 
and 11. 

There must, of course, be some 
minimum response time before which 
no response can occur. This is not 
easy to determine from Henmon'’s 
tables of results, because the data are 
already grouped in intervals of 200 
milliseconds. For this reason, the 
minimum possible time was estimated 
in the following way. For various 
assumed minimum times, estimates of 
« and B were determined, and the 
theoretical distribution of choice times 
computed. The value leading to the 
best fit was then adopted. This is not 
entirely a satisfactory procedure, but 
with K assumed to be 2, and with no 
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TABLE 1 
Subject BI | Subject Br 
i ি A | [" Interv: Observed | Expected 
TSE RG EES ESR EE Frequency 
100- (2): - 100-299 (2): ৰ 
300- | 57 53 300- 350 352 
500- 214 229 500- 381 | 398 
700- || 220 229 700- 170 165 
900- 159 168 900- 63 Ny 
1100- | 113 1 1100- 26 2% 
1300- 85 83 1300- 5 
1500- 74 48 Above 1500 || k 
1700- 32 SDN ad A Smal VN Ae 2 EY 
1900- 18 20 1000 1000 
2100- 11 10 
2300- 8 8 | 
Above 2500- 7 | bss || | 
1000 1000 | 


* These observations ignored in calculations. 


direct indication of the minimum 
time, it seemed the best available in 
the circumstances. The results for 
Henmon’s (1911, Table 2, p. 194) Ss 
Bl and Br are considered below. 

For BI, the minimum possible time 
Was taken to be about 0.40 sec. On 
this basis « = 3.19 and B = 1.28, these 
values referring to a time scale meas- 
ured ‘in seconds. For Br, the mini- 
mum time was taken to be 0.34 sec. 
giving « = 6.68 and B= 4.28. A 
comparison of the observed and ex- 
pected distributions of response times 
is given in Table 1. 
between model and d 
reasonably good. 


The agreement 
ata seems to be 


CONCLUDING REMARKS 

On the whole, there is 
looseness in the Way in wh 
contemporary theories 
hypotheses are linked to Observed re- 
sponse variables. It seems worth- 
while, therefore, to try to determine 
whether these variables might not be 
related to one another by rel 


a certain 
ich many 
and even local 


atively 


simple laws which operate in most 
choice situations. In this way, not 
only are descriptions of choice be- 
havior considerably simplified, but 
better ways of formulating and testing 
theories are suggested. The model 
itself is naturally also a theory about 
a certain aspect of behavior, and as 
such needs to be tested. 

In this presentation of the general 
stochastic model the intention is to 
indicate the potentialities of the ap- 
proach, rather than to make specific 
tests of the case arising when K = 2. 
It is not to be expected that the two 
simple assumptions will alone account 
for the relations existing between re- 
sponse variables in a wide diversity of 
situations. Each situation will un- 
doubtedly have certain unique condi- 
tions which have to be taken into ac- 
count. But the model does seem to 
share certain important properties 
with choice behavior and বর 
appears to be a reasonable a 
working hypothesis. It can be teste 
in great detail against data, and the 
parameters are of a kind which could 
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be identified with either psychological 
or physiological constructs. 

Methods of estimating parameters 
and statistical tests of goodness of fit 
will be discussed elsewhere. For the 
present model, neither of these pro- 
cedures involves any novel problems. 
For example, given the probability of 
occurrence of one of the alternative 
responses and the over-all mean re- 
sponse time, Equations 6 and 11 may 
be easily solved to give the appro- 
priate parameter values. 
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A MATHEMATICAL MODEL FOR SIMPLE LEARNING 


BY ROBERT R. BUSH: AND FREDERICK MOSTELLER 


Harvard University 2 


Introduction 


Mathematical models for empirical 
Phenomena aid the development of a 
science when a sufficient body of quan- 
titative information has been accumu- 
lated. This accumulation can be used 
to point the direction in which models 
should be constructed and to test 
the adequacy of such models in their 
interim states. Models, in turn, fre- 
quently are useful in Organizing and 
interpreting experimental data and in 
suggesting new directions for experi- 
mental research. Among the branches 
of psychology, few are as rich as learn- 
ingin quantity and variety of available 
data necessary for model building. 
Evidence of this fact is provided by 
the numerous attempts to Construct 
quantitative models for learning phe- 
nomena. The most recent contribu- 
tion is that of Estes (2). 


In this paper we shall present the 
basic structure of a new mathematical 
model designed to describe some simple 
learning situations. We shall focus 
attention on acquisition and extinction 
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in experimental arrangements HE 
straight runways and Skinner boxes, 
though we believe the model is ন] 
general; we plan to extend ‘the mode. 
in order to describe multiple-choice 
problems and experiments in generali- 
zation and discrimination in I 
papers. Wherever possible we shal 

discuss the correspondence PEE 
our model and the one being deNE RE 

by Estes’ (2), since striking Pa ন 
do exist even though many of the 
basic premises differ. Our model is 
discussed and developed primarily 1G 
terms of reinforcement concepts whi i 
Estes’ model stems from an SE 
to formalize association theory. Bot 

models, however, may be ree 
preted in terms of other sets of con 
cepts. This state of affairs is a a 
mon feature of most ma thema 
models. An example is the RATE 
and wave interpretations of moder 

atomic theory. 

We are SOc with the FYDE of 
learning which has been called oR 
mental conditioning” (5), ‘operan 
behavior” or “type R condi CET, 
(10), and not with “classical condi; 
tioning” (5), “Pavlovian COO HOT 
or “type S conditioning" (10). 
shall follow Sears (9) in dividing 
the chain of events as follows: (1) ie 
ception of a stimulus, (2) EG) 
Of a response or instrumental act, t 
Occurrence of an environmental event, 


1951, 58, 313-323. Reprinted with permission. 
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and (4) execution of a goal response. 
Examples of instrumental responses 
are the traversing of a runway, press- 
ing of a lever, etc. By environmental 
events we mean the presentation of a 
“reinforcing stimulus'" (10) such as 
food or water, but we wish to include 
in this category electric shocks and 
other forms of punishment, removal 
of the animal from the apparatus, the 
sounding of a buzzer, etc. Hence any 
change in the stimulus situation which 
follows an instrumental response is 
called an environmental event. A goal 
response, such as eating food or drink- 
ing water, is not necessarily involved 
in the chain. It is implied, however, 
that the organism has a motivation 
or drive which corresponds to some 
goal response. Operationally speak- 
ing, we infer a state of motivation 
from observing a goal response. 


Probabilities and How They Change 


As a measure of behavior, we have 
chosen the probability, p, that the 
instrumental response will occur dur- 
ing a specified time, h. This proba- 
bility will change during conditioning 
and extinction and will be related to 
experimental variables such as latent 
time, rate, and frequency of choices. 
‘The choice of the time interval, h, will 
be discussed later. We conceive that 
the probability, 2, is increased or de- 
creased a small amount after each 
occurrence of the response and that 
the determinants of the amount of 
change in p are the environmental 
events and the work or effort expended 
in making the response. In addition, 
of course, the magnitude of the change 
depends upon the properties of the 
organism and upon the value of the 
probability before the response oC- 
curred. For example, if the proba 
bility was already unity, it could not 
be increased further. 


Our task, then, is to describe the 
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change in probability which occurs 
after each performance of the response 
being studied. We wish to express 
this change in terms of the probability 
immediately prior to the occurrence 
of the response and so we explicitly 
assume that the change is independent 
of the still earlier values of the proba- 
bility. For convenience in describing 
the step-wise change in probability, 
we introduce the concept of a mathe- 
matical operator. The notion is ele- 
mentary and in no way mysterious: 
an operator Q when applied to an 
operand yields a new quantity Op 
(read Q operating on 2). Ordinary 
mathematical operations of addition, 
multiplication, differentiation, etc., 
may be defined in terms of operators. 
For the present purpose, we are inter- 
ested in a class of operators Q which 
when applied to our probability p will 
give a new value of probability Op. 
As mentioned above, we are assuming 
that this new probability, Op, can be 
expressed in terms of the old value, p. 
Supposing Qp to be a well-behaved 
function, we can expand it as a power 


series in p: 


Op = av + ap + axp* + +--+ (1) 


where ao, a1, 02, * * * are constants inde- 
pendent of b. In order to simplify the 
mathematical analysis which follows, 
we shall retain only the first two terms 
in this expansion. Thus, we are as- 
suming that we can employ operators 
which represent a linear transforma- 
tionon. IH the change is small, one 
would expect that this assumption 
would provide an adequate first ap- 
proximation. Our operator Q is then 
completely defined as soon as we 
specify the constants av and ai; this 
is the major problem at hand. For 
reasons that will soon be apparent, we 
choose to let ao = a and a1 = 1 — a — b. 
This choice of parameters permits us 
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to write our operator in the form 
Op=p+0(1-p)-bh. (2) 


This is our basic operator and equation 
(2) will be used as the cornerstone for 
our theoretical development. To main- 
tain the probability between 0 and 1, 
the parameters a and b must also lie 
between 0 and 1. Since a is positive, 
we see that the term, a(1 — 2), of 
equation (2) corresponds to an incre- 
ment in p which is proportional to the 
maximum possible increment, (1 — 2). 
Moreover, since bis positive, the term, 
—bp, corresponds to a decrement in p 
which is proportional to the maximum 
possible decrement, —. Therefore, 
we can associate with the parameter a 
those factors which always increase 
the probability and with the param- 
eter b those factors which always de- 
crease the probability. It is for these 
reasons that we rewrote our operator 
in the form given in equation (2). 

We associate the event of presenting 
a reward or other reinforcing stimulus 
with the parameter a, and we assume 
that a = 0 when no reward is given 
as in experimental extinction. With 
the parameter b, we associate events 
such as punishment and the work 
required in making the response. (See 
the review by Solomon [11] of the 
influence of work on behavior.) In 
many respects, our term, all — p), 
Corresponds to an increment in “ 
citatory potential" 
(6) and our term, 
an increment 
potential.” 

In this Paper, we make no further 
attempt to relate our para 
and b, to experimental variables such 
as amount of reward, amount of work, 
strength of motivation, etc. In com- 
Paring our theoretical results with 
experimental data, we will choose 
values of a and b which ive the best 
fit. In other words, our model at the 


ex- 
in Hull's theory 
—bp, corresponds to 
in Hull's “inhibitory 


meters, a 
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present time is concerned only with 
the form of conditioning and extinc- 
tion curves, not with the precise values 
of parameters for particular conditions 
and particular organisms. 


Continuous Reinforcement and 
Extinction 


Up to this point, we have discussed 
only the effect of the occurrence of a 
response upon the probability of that 
response. Since probability must be 
conserved, f.e., since in a time interval 
han organism will make some response 
or no response, we must investigate 
the effect of the occurrence of one 
response upon the probability of an- 
other response. In a later paper, we 
shall discuss this problem in detail, 
but for the present purpose we must 
include the following assumption. We 
conceive that there are two general 
kinds of responses, overt and non- 
Overt. The overt responses are sub- 
divided into classes A, B, C, etc. If 
Aan overt response A occurs and is 
neither rewarded nor punished, then 
the probability of any mutually ex- 
clusive overt response Bis not changed. 
Nevertheless, the probability of that 
response A is changed after an occur- 
rence on which it is neither rewarded 
nor punished. Since the total proba- 
bility of all responses must be ur 
it follows that the probability gaine! 
Or lost by response A must be compen- 
sated by a corresponding loss or gain 
in probability of the non-overt re- 
sponses. This assumption is impor- 
tant in the analysis of experiments 
which use a runway or Skinner box, 
for example. In such experiments দঃ 
single class of responses is singled ou 
for study, but other overt responses 
can and do occur. We defer until 9 
later paper the discussion of experi 
ments in which two or more responses 
are reinforced differentially. ica 

With the aid of our mathematica 
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operator of equation (2) we may now 
describe the progressive change in the 
probability of a response in an experi- 
ment such as the Graham-Gagné run- 
way (3) or Skinner box (10) in which 
the same environmental events follow 
each occurrence of the response. We 
need only apply our operator Q re- 
peatedly to some initial value of the 
probability p. Each application of 
the operator corresponds to one occur- 
rence of the response and the sub- 
sequent environmental events. The 
algebra involved in these manipula- 
tionsis straightforward. For example, 
if we apply Q to p twice, we have 


O% = O(OpP) = at (1 —a-b)Op 
at+(1-a-b) 
X[a+(1-a-b)b]. (0) 


[| 


||| 


Moreover, it may be readily shown 
that if we apply QO to p successively 1 
times, we have 


as i Bf = — 
= (75 ’) 

X (1 -—a- b). (4) 
Provided a and b are not both zero or 


both unity, the quantity (1 —- a — b)" 
tends to an asymptotic value of zero 
as mn increases. Therefore, O"p ap- 
proaches a limiting value of a/(a + b) 
as n becomes large. Equation (4) 
then describes a curve of acquisition. 

It should be noticed that the asymp- 
totic value of the probability is not 
necessarily either zero or unity. For 
example, if a = b (speaking roughly 
this implies that the measures of re- 
ward and work are equal), the ultimate 
probability of occurrence in time A of 
the response being studied is 0.5. 


Since we have assumed that a = 0 
when no reward is gi 


ven after the 
response occurs, We may describe an 


extinction trial by a special operator 
E which is equivalent to our operator 
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Q of equation (2) with a set equal to 
zero: 


Ep=-—-bb=(0-b. (5) 


It follows directly that if we apply 
this operator E to p successively for 1 
times we have 


Erp = (1 — bp. (6) 


This equation then describes a curve 
of experimental extinction. 


Probability, Latent Time, and Rate 


Before the above results on continu- 
ous reinforcement and extinction can 
be compared with empirical results, 
we must first establish relationships 
between our probability, p, and ex- 
perimental measures such as latent 
time and rate of responding. In order 
to do this, we must have a model. 
A simple and useful model is the one 
described by Estes (2). Let the ac- 
tivity of an organism be described by a 
sequence of responses which are inde- 
pendent of one another. (For this 
purpose, we consider doing “‘nothing" 
to be a response.) The probability 
that the response or class of responses 
being studied will occur first is p. 
Since we have already assumed that 
non-reinforced occurrences of other 
responses do not affect fp, one may 
easily calculate the mean number of 
responses which will occur before the 
response being studied takes place. 
Estes (2) has presented this calcula- 
tion and shown that the mean number 
of responses which will occur, includ- 
ing the one being studied, issimply 1/2. 
In that derivation it was assumed that 
the responses were all independent of 
one another, i.e., that transition prob- 
abilities between pairs of responses are 
the same for all pairs. This assump- 
tion is a bold one indeed (it is easy to 
think of overt responses that cannot 
follow one another), but it appears to 
us that any other assumption would 
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require a detailed specification of the 
many possible responses in each ex- 
perimental arrangement being consid- 
ered. (Miller and Frick [8] have 
attempted such an analysis for a par- 
ticular experiment.) It is further 
assumed that every response requires 
the same amount of time, h, for its 
performance. The mean latent time, 
then, is simply h times the mean num- 
ber of responses which occur on a 
“trial”: 

Es 2 (7) 

2 

The time, h, required for each response 
will depend, of Course, on the organism 
involved and very likely upon its 
strength of drive or motivation. 

The mean latent time, L, is ex- 
pressed in terms of the Probability, p, 
by equation (7), while this probability 
is given in terms of the number of 
trials, n, by equation (4). Hence we 
may obtain an expression for the mean 
latent time as a function of the num- 
ber of trials. It turns out that this 
expression is identical to equation (4) 
of Estes’ paper (2) except for differ- 
ences in notation. (Estes uses T in 
Place of our 1; our use of a difference 
equation rather than of a differential 
equation gives us the term(l-—-a-— 5) 
instead of Estes’ e571.) Estes fitted 
his equation to the data of Graham 
and Gagné (3). Our results differ 
rom Estes’ in one respect, however: 
the Asymptotic mean latent time in 
Estes’ model is simply h, while we 


Obtain 
phat 
Liss: = (+2) % (8) 


This equation s 


two variables, 
clusion seems to agree with the data 
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of Grindley (4) on chicks and the data 
of Crespi (1) on white rats. ' 
Since equation (7) is an expression 
for the mean time between the end of 
one response of the type being studied 
and the end of the next response of the 
type being studied, we may now cal- 
culate the mean rate of responding in 
a Skinner-box arrangement. If t rep- 
resents the mean time required for the 
Occurrence of mn responses, measured 
from some arbitrary starting point, 
then each occurrence of the response 
being studied adds an increment in i 
as follows: ম্‌ 
= 09) 
An p 
If the increments are sufficiently small, 
We may write them as differentials and 
Obtain for the mean rate of responding 


= wp, (10) 
where w = 1/h. We shall call uw the 
“activity level’ and by definition 4 
is the maximum rate of responding 
Which occurs when ? = 1 obtains. 


The Free-Responding Situation 


In free-responding situations, such 
as that in Skinner box experiments, one 
usually measures rate of responding or 
the cumulative number of responses 
versus time. To obtain theoretical 
expressions for these relations, we first 
Obtain an expression for the proba- 
bility p as a function of time. From 
equation (2), we see that if the re- 
sponse being studied occurs, the change 
in probability is Ap = oll — 2) — bp. 
We have already assumed that if other 
responses occur and are not reinforced, 
no change in the probability of 7 
rence of the response being studied wi 
ensue. Hence the expected change in 
probability during a time interval h 
is merely the change in probability 
times the probability p that the re- 
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sponse being studied occurs in that 
time interval: 


Expected (Ap) 
= pla(l —) — bp}. (1) 


The expected rate of change of proba- 
bility with time is then this expression 
divided by the time h. Writing this 
rate as a derivative we have 


d 

4 pfo(t — 2) — bp) 02) 
where, as already defined, w = 1/h is 
the activity level. This‘equation is 
easily integrated to give f as an ex- 
plicit function of time t. Since equa- 
tion (10) states that the mean rate of 
responding, dn/dt, is w times the prob- 
ability p, we obtain after the inte- 


gration 


dn wpo 


dt =U A-n oe Y 
(13) 


Where we have let u = b/a. The 
initial rate of responding at t = 0 is 
V, = uwpo, and the final rate after a 


very long time t is 


dn w w 
Es a at Lr {Fu 1+bla" a 


Equation (13) is quite similar to the 
expression obtained by Estes except 
for our inclusion of the ratio u = b/a. 
The final rate of responding according 
to equation (14), increases with a and 
hence with the amount of reward given 
per response, and decreases with b and 
hence with the amount of work per 
response. These conclusions do not 
follow from Estes’ results (2). 

An expression for the cumulative 
number of responses during continu- 
ous reinforcement is obtained by inte- 
grating equation (13) with respect to 


timet. The result is 


L 
1+u 


X(1- e+e]. (5) 


m= 


1 
ut + zoe [Pod + u) 


As the time t becomes very large, the 
exponentials in equation (15) approach 
zero and n becomes a linear function 
of time. This agrees with equation 
(14) which says that the asymptotic 
rate is a constant. Both equations 
(13) and (15) for rate of responding 
and cumulative number of responses, 
respectively, have the same form as 
the analogous equations derived by 
Estes (2) which were fitted by him to 
data on a bar-pressing habit of rats. 
The essential difference between Estes’ 
results and ours is the dependence, 
discussed above, of the final rate upon 
amount of work and amount of reward 
per trial. 

We may extend our analysis to give 
expressions for rates and cumulative 
responses during extinction. Since we 
have assumed that a = 0 during ex- 
tinction, we have in place of equa- 
tion (12) 

0p ub 
2 wbp (16) 


which when integrated for p and mul- 
tiplied by w gives 


dm _ wpe 
at ™ 1F bpd n) 


where p. is the probability at the be- 
ginning of extinction. The rate at the 
beginning of extinction is Ve = ube. 
Hence we may write equation (17) in 
the form 


dm Vv 


V= = —— . 
dd 1+Veb 


(18) 


An integration of this equation gives 
for the cumulative number of extinc- 
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tion responses 


m= jog [14+ Vb] 


i jog (7) . (9) 


This result is similar to the empirical 
equation m = K log t, used by Skinner 
in fitting experimental response curves 
(10). Ourequation has the additional 
advantage of passing through the ori- 
gin as it must. 

It may be noted that the logarithmic 
character of equation (19) implies that 
the total number of extinction re- 
sponses, m, has no upper limit. Thus, 
if our result is correct, and indeed if 
Skinner's empirical equation is correct, 
then there is no upper limit to the 
size of the “‘reserve” of extinction re- 
sponses. For all practical purposes, 
however, the logarithmic Variation is 
50 slow for large values of the time t, 
it is justified to use some arbitrary 
criterion for the “completion of ex- 
tinction. Weshall consider extinction 
to be ‘“‘complete” when the mean rate 
of responding V has fallen to some 
specified value, V,. Thus, the “total” 


number of extinction responses from 
this criterion is 


L V, 
my log Ts (20) 
We now wish to express this “ 
number of extinction response 
as an explicit function of the number 
of preceding reinforcements, mn. The 
only quantity in equation (20) which 
depends Upon nis the rate, V,, at the 
beginning of extinction. Jf we assume 
that this rate is equal to the rate at 
the end of acquisition, we have from 
equations (4) and (10) 


total" 
S, Mr, 


Vie Sb = 


max 


(Veg — WL =t =)" (02%) 
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where we have let 


a 
Vmax = w Ea (22) 
and where Vo = wbo is the rate at the 
beginning of acquisition. lf we now 
substitute equation (21) into equation 
(20), we obtain 


1 Vinax Vax _— Vo 
iS) 5 SEE 


SX (1 =a -'}- (23) 


This result may be com pared with the 
data of Williams (12) obtained by 
measuring the “total number of ex- 
tinction responses after 5, 10, 30 and 
90 reinforcements. From the data, 
the ratio Vinax/ Vy, was estimated to 
be about 5, and the ratio Vo/ Vy was 
assumed to be about unity. Values 
of a = 0.014 and b = 0.026 were chosen 
in fitting equation (23) to the data. 
The result is shown in the figure. 


Fixed Ratio and Random Ratio 
Reinforcement 


In present day psychological lan- 
guage, the term “fixed ratio" (7) refers 
to the procedure of rewarding every 
hth response in a free-responding situ- 
ation (k = 2,3, ...). Ina‘‘random 
ratio" schedule, an animal is rewarded 
On the average after k responses but 8 
actual number of responses per rewar' 
Varies over some specified range. We 
shall now derive expressions for mean 
rates of responding and cumulative 
numbers of responses for these Re 
types of reinforcement schedules. K 
we apply our operator 0, of CUD 
(2), to a probability p, and then apply 
Our operator E, of equation (5), to Op 
repeatedly for (k — 1) times, we obtain 


(E™10)p = (1b) p+ a(1—p)—bP] 
=+0'(1-p)-bp (2%) 
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Curve plotted from equation (23) wit 
Data from Williams (12). 


where 
a’ =al(l—b)*™! 
=af1—(k—-1)b+*:: }1=a (25) 
and 
b'=1-—(1-b)* 


—1 
= kb 1-4" >kb. (26) 


The symbol = means “approximately 

In the present case the 
ach would be to retain the 
and b throughout; how- 
ons provide a link 
discussion. The 


equal to." 
exact appro. 
primes on @ 
ever the approximati 
with the previous 
approximations on the right of these 
two equations are justified if kb is 
small compared to unity. Now the 
mean change in p per response will be 
the second and third terms of equation 


(24) divided by k: 


es as a function of the number of reinforcements. 
h b = 0.026, a = 0.014, Vmax = 5Vo, 


Vr = Vt. 


This equation is identical to our result 
for continuous reinforcement, except 
that a’/k replaces a and b'/k replaces b. 

We may obtain a similar result for 
the “random ratio" schedule as fol- 
lows: After any response, the proba- 
bility that Q operates on pis 1/k and 
the probability that E operates on 
pie (T = 1/k). Hence the expected 
change in p per response is 


Expected (Ap) = i000 
+ (1 -— URED — 2. 


After equations (2) and (5) are inserted 
and the result simplified, we obtain 
from equation (28) 


Expected (Ap) 


LAF 


z0-B)- b- 


(28) 


(29) 


This result is identical to the approxi- 
mate result shown in equation (27) for 
the fixed ratio case. Since both equa- 
tions (27) and (29) have the same 
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form as our result for the continuous 
reinforcement case, we may at once 
write for the mean rate of responding 
an equation identical to equation (13), 
except that a is replaced by a'/k. 
Similarly, we obtain an expression for 
the final rate of responding identical 
to equation (14) except that a is re- 
placed by a’/k. This result is meant 
to apply to both fixed ratio and ran- 
dom ratio schedules of reinforcement, 

In comparing the above result for 
the asymptotic rates with equation 
(14) for continuous reinforcement, we 
must be careful about equating the 
activity level, w, for the three cases 
(continuous, fixed ratio and random 
ratio reinforcements). Since 1/w rep- 
resents the minimum mean time be- 
tween successive responses, it includes 
both the eating time and a “recovery 
time.” By the latter we mean the 
time necessary for the animal to re- 
Organize itself after eating and get in 
a position to make another bar press 
or key peck. In the fixed ratio case, 
presumably the animal learns to look 
for food not after each press or peck, 
as in the continuous case, but ideally 
only after every k response. There- 
fore both the mean eating time and 
the mean recovery time per response 
are less for the fixed ratio case than 
for the continuous case. In the ran- 
dom ratio case, one would expect a 
similar but smaller difference to occur. 
Hence, it seems reasonable to conclude 
that the activity level, w, would be 
smaller for continuous reinforcement 
than for either fixed ratio or random 
ratio, and that w would be lower for 
random ratio than for fixed ratio when 
the mean number of responses per 
reward was the same. Moreover, we 
should expect that w Would increase 
with the number of responses per re- 
ward, k. Even if eating time were 
subtracted out in all cases we should 
expect these arguments to apply. 
Without a quantitative estimate of 
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the mean recovery time, we see no 
meaningful way of comparing rates of 
responding under continuous reinforce- 
ment with those under fixed ratio and 
random ratio, nor of comparing rates 
under different ratios (unless both 
ratios are large). The difficulty of 
comparing rates under various rein- 
forcement schedules does not seem to 
be a weakness of our model, but rather 
a natural consequence of the experi- 
mental procedure. However, the im- 
portance of these considerations hinges 
upon the orders of magnitude involved, 
and such questions are empirical ones. 


Aperiodic and Periodic Reinforcement 


Many experiments of recent years 
were designed so that an animal was 
reinforced at a rate aperiodic or peri- 
odicin time (7). The usual procedure 
is to choose a set of time intervals, 
Ti, -.., Ti, which have a mean value 
T. Some arrangement of this set is 
used as the actual sequence of time 
intervals between rewards. The first 
response which occurs after one of 
these time intervals has elapsed is 
rewarded. 

To analyze this situation we may 
consider k, the mean number of re- 
sponses per reward, to be equal to the 
mean time interval T multiplied by 
the mean rate of responding: 


dn 
k= IT = Tub. 


(30) 
Equation (29) for the expected change 
in probability per response is still valid 
if we now consider k to be a variable 


as expressed by equation (30). Thus, 
the time rate of change of p is 
dp a 
4 দা চহ — wbp?. (31) 
dt 7 6! 2) wbp' 


With a little effort, this differential 
equation may be integrated from 0 to 
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t to give 
dn 
gE =P 
20D (32) 
where 
2 = 2uTb/a, (33) 
s= V1+ 2s, (34) 


K=(1+3p0—5)/( +9045). (35) 
For arbitrarily large times t, the final 
rate is 


dn 2 ee 
(Ee B (= 1); 686) 


For sufficiently large values of T, 
z becomes large compared to unity 
and we may write approximately 


Thus, for large values 
rate varies inversely as 


of T. ) 
Periodic reinforcement is a spe- 


cial case of aperiodic reinforcement 
in which the set of time intervals, 
Ti, Tn discussed above, consists 
of a single time interval, T. Thus, 
all the above equations apply to both 
periodic and aperiodic schedules. One 
essential difference is known, however. 
In the periodic case the animal can 
learn a time discrimination, or as is 
sometimes said, eating becomes a cue 
for not responding for a while. This 
seems to be an example of stimulus 
discrimination which we will discuss 


in a later paper. 


er Partial Reinforcement 
Schedules 


of T, the final 
the square root 


Extinction Aft 


In the discussion of extinction in 
earlier sections, it may be noted that 
the equations for mean rates and 
cumulative responses depended on the 
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previous reward training only through 
V,, the mean rate at the beginning of 
extinction. Hence, we conclude that 
equations (18) and (19) apply to 
extinction after any type of reinforce- 
ment schedule. However, the quan- 
tities Ve and b in our equations may 
depend very much on the previous 
training. Indeed, if our model makes 
any sense at all, this must be the case, 
for “resistance” to extinction is known 
to be much greater after partial rein- 
forcement training than after a con- 
tinuous reinforcement schedule (7). 
Since the rate at the start of extinc- 
tion, Ve, is nearly equal to the rate at 
the end of acquisition, it will certainly 
depend on the type and amount of 
previous training. However, the log- 
arithmic variation in equations (19) 
and (20) is so slow, it seems clear that 
empirical results demand a dependence 
of b on the type of reinforcement 
schedule which preceded extinction. 
We have argued that b increases with 
the amount of work required per re- 
sponse. We will now try to indicate 
how the required work might depend 
upon the type of reinforcement sched- 
ule, even though the lever pressure or 
key tension is the same. For con- 
tinuous reinforcement, the response 
pattern which is learned by a pigeon, 
for example, involves pecking the key 
once, lowering its head to the food 
magazine, eating, raising its head, and 
readjusting its body in preparation for 
the next peck. This response pattern 
demands a certain amount of effort. 
On the other hand, the response pat- 
tern which is learned for other types 
of reinforcement schedules is quite 
different; the bird makes several key 
pecks before executing the rest of 
the pattern just described. Thus we 
would expect that the average work 
required per key peck is considerably 
less than for continuous reinforcement. 
This would imply that b is larger and 
thus “‘resistance’’ to extinction is less 
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for continuous reinforcement than for 
all other schedules. This deduction 
is consistent with experimental results 
(7). However, this is just part of the 
story. For one thing, it seems clear 
that it is easier for the organism to 
discriminate between continuous rein- 
forcement and extinction; we have not 
handled this effect here. 


Summary 


A mathematical model for simple 
learning is presented. Changes in the 
probability of occurrence of a response 
in a small time h are described with 
the aid of mathematical operators. 
The parameters which appear in the 
operator equations are related to exper- 
imental variables such as the amount 
of reward and work. Relations be- 
tween the probability and empirical 
measures of rate of responding and 
latent time are defined. Acquisition 
and extinction of behavior habits are 
discussed for the simple runway and 
for the Skinner box. Equations of 
mean latent time as a function of trial 
number are derived for the runway 
problem; equations for the mean rate 
of responding and cumulative numbers 
of responses versus time are derived 
for the Skinner box experiments. An 
attempt is made to analyze the learn- 
ing process with various schedules of 
partial reinforcement in the Skinner 
type experiment. Wherever possible, 
the correspondence between the pres- 
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ent model and the work of Estes (2) 
is pointed out. 
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A MODEL FOR STIMULUS GENERALIZATION 
AND DISCRIMINATION 


BY ROBERT R. BUSH! AND FREDERICK MOSTELLER 
Harvard University: 


INTRODUCTION 


The processes of stimulus generali- 
zation and discrimination seem as fun- 
damental to behavior theory as the 
simple mechanisms of reinforcement 
and extinction are to learning theory. 
Whether or not this distinction be- 
tween learning and behavior is a useful 
one, there can be little doubt that few 
if any applications of behavior theory 
to practical problems can be made 
without a clear exposition of the phe- 
nomena of generalization and discrimi- 
nation. It is our impression that few 
crucial experiments in this area have 
been reported compared with the num- 
ber of important experiments on simple 
conditioning and extinction. Perhaps 
part of the reason for this is that there 
are too few theoretical formulations 
That is to say, we con- 
d quantitative 
in 


available. 
ceive that explicit an 
theoretical structures are useful 
guiding the direction of experimental 
research and in suggesting the type of 


data which are needed. 
In this paper we describe a model, 


based upon elementary concepts of 
mathematical set theory. This model 
provides one possible framework for 
analyzing problems in stimulus gen- 
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eralization and discrimination. Fur- 
ther, we shall show how this model 
generates the basic postulates of our 
previous work on acquisition and ex- 
tinction (1), where the stimulus situa- 
tion as defined by the experimenter 
was assumed constant. 

Stated in the simplest terms, gener- 
alization is the phenomenon in which 
an increase in strength of a response 
learned in one stimulus situation im- 
plies an increase in strength of response 
in a somewhat different stimulus sit- 
uation. When this occurs, the two 
situations are said to be similar.  Al- 
though there are several intuitive 
notions as to what is meant by “‘simi- 
larity," one usually means the proper- 
ties which give rise to generalization. 
We see no alternative to using the 
amount of generalization as an opera- 
tional definition of degree of “‘simi- 
larity." In the model, however, we 
shall give another definition of the 
degree of similarity, but this definition 
will be entirely consistent with the 
above-mentioned operational defini- 
tion. 

We also wish to clarify what we 
mean by stimulus discrimination. In 
one sense of the term, all learning is a 
process of discrimination. Our usage 
of the term is a more restricted one, 
however. We refer specifically to the 
process by which an animal learns to 
make response A in one stimulus situ- 
ation and response B (or response A 
with different “‘strength") in a differ- 
ent stimulus situation. We are not 
at the moment concerned with, for 
example, the process by which an 
animal learns to discriminate between 
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Various possible responses in a fixed 
stimulus situation. 

As prototypes of the more general 
problems of stimulus generalization 
and discrimination, we shall consider 
the following two kinds of experiments: 


(i) An animal is trained to make a 
particular response, by the usual rein- 
forcement procedure, in an experimen- 
tally defined stimulus situation. At the 
end of training, the response has a certain 
strength or probability of Occurrence. 
The animal is then “tested” in a new 
stimulus situation similar to the training 
one and in which the same response, 
insofar as it is experimentally defined, 
is possible. One then asks about the 
strength or probability of occurrence of 
the response in this new stimulus situa- 
tion and how it depends on the degree of 
similarity of the new situation to the old 
stimulus situation. 

(ii) An animal is Presented alternately 
with two stimulus situations which are 
similar. In one, an experimentally de- 
fined response is rewarded, and in the 
other that response is either not re- 
warded or rewarded less than in the first. 
Through the Process of generalization, 
the effects of rewards and non-rewards 
in one stimulus situation influence the 
response strength in the other, but even- 
tually the animal learns to respond in 
one but not in the other, or at least to 
respond with different probabilities (rates 
Or strengths). One then asks how the 
Probability of the response in each situa- 
tion varies with the number of training 
trials, with the degree of similarity of the 


two situations, and with the amount of 
reward. 


We do not consider that these two 
kinds of experiments come close to ex- 
hausting the Problems classified under 
the heading of feneralization and dis- 
crimination, but we do believe that 
they are fundamental. Thus, the 
model to be described has been de- 
signed to permit analysis of these 
experiments. In the next section we 
will present the major features of the 
model, and in later sections we shall 
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apply it to the above described ex- 
periments. 


THE MoDEL 


We shall employ some of the ele- 
mentary notions of mathematical set 
theory to define our model. A par- 
ticular stimulus situation, such as an 
experimental box with specific prop- 
erties (geometrical, optical, acoustical, 
etc.) is regarded as separate and dis- 
tinct from the rest of the universe. 
Thus, we shall denote this situation 
by a set of stimuli which is part of the 
entire universe of stimuli. The ele- 
ments of this set are undefined and we 
Place no restriction on their number. 
This lack of definition of the stimulus 
elements does not give rise to any 
serious difficulties since our final results 
involve neither properties of individual 
elements nor numbers of such ele- 
ments. Wenext introduce the notion 
Of the measure of a set. If the set con- 
sists of a finite number of elements, we 
may associate with each element a posil- 
tive number to denote its “‘weight"; 
the measure of such a sel is the sum of 
all these numbers. Intuitively, the 
Weight associated with an element is 
the measure of the potential impor- 
tance of that element in influencing 
the organism’s behavior. More gen- 
erally, we can define a density function 
Over the set; the measure is the inte- 
8ral of that function over the sel. 

To bridge the gap between stimuli 
and responses, we shall borrow some 
of the basic notions of Estes (2). 
(The ‘concept of reinforcement will 
play an integral role, however.) It 1s 
assumed that stimulus elements exist 
in one of two states as far as the 
Organism involved is concerned; since 
the elements are undefined, these states 
do not require definition but merely 
need labelling. However, we shall 
speak of elements which are in one 
State as being “conditioned” to the 
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response, and of elements in the other 
state as being “‘non-conditioned."’ 

On a particular trial or occurrence 
of a response in the learning process, 
it is conceived that an organism per- 
ceives a sub-set of the total stimuli 
available. It is postulated that the 
probability of occurrence of the re- 
sponse in a given time interval is equal 
to the measure of the elements in the 
sub-set which had been previously con- 
ditioned, divided by the measure of 
the entire sub-set. Speaking roughly, 
the probability is the ratio of the im- 
portance of the conditioned elements 
perceived to the importance of all the 
elements perceived. It is further as- 
sumed that the sub-set perceived is 
conditioned to the response if that 
response is rewarded. . 

The situation is illustrated in Fig. 1. 
It would be wrong to suppose that the 
conditioned and non-conditioned ele- 
ments are spatially separated in the 
actual situation as Fig. 1 might sug- 
gest; the conditioned elements are 
spread out smoothly among the non- 
conditioned ones. In set-theoretic 
notation, we then have for the proba- 
bility of occurrence of the response 


m(Xn C) 
EET) 0 


where m( ) denotes the measure of 
any set or sub-set named between the 
parentheses, and where Xn C indi- 
cates the intersection of X and C (also 
called set-product, meet, or overlap of 
Xand C). We then make an assump- 
tion of equal proportions in the meas- 
ures so that 
m(Xn C) _ m(C) 
P= THR) ms) 
m(X) m 


Heuristically, this assumption of 
equal proportions can arise from a 


fluid model. Suppose that the total 
situation is represented by a vessel 


containing an ideal fluid which is a 


2) 
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Set diagram of the single stimulus 


Fic. 1. 
situation S with the various sub-sets involved 


in a particular trial. C is the sub-set of 
elements previously conditioned, X the sub- 
set of S perceived on the trial. The sub-sets 
A and B are defined in the text. 


mixture of two substances which do 
not chemically interact but are com- 
pletely miscible. For discussion let 
the substances be water and alcohol 
and assume, contrary to fact, that the 
volume of the mixture is equal to the 
sum of the partial volumes. The vol- 


‘ume of the water corresponds to the 


measure of the sub-set of non-condi- 
tioned stimuli, S — C (total set minus 
the conditioned set), and the volume 
of the alcohol corresponds to the 
measure of the sub-set C of condi- 
tioned stimuli. The sub-set X corre- 
sponds to a thimbleful of the mixture 
and of course if the fluids are well 
mixed, the volumetric fraction of alco- 
hol in a thimbleful will be much the 
same as that in the whole vessel. 
Thus the fraction of measure of con- 
ditioned stimuli in X will be equal to 
the fraction in the whole set S, as 
expressed by equation (2). Our defi- 
nition of p is essentially that of Estes 
(2) except that where he speaks of 
number of elements, we speak of the 
measure of the elements. 

We next consider another stimulus 
situation which we denote by a set S’. 
In general this new set S’ will not be 
disjunct from the set S, i.e., S and 5’ 
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will intersect or overlap as shown in 
Fig. 2. We denote the intersection by 


I = Sn. (3) 


We can now define an index of 
similarity of S’ to S by 


m(I 

৮০৩) = দত. 
In words this definition says that the 
index of similarity of S’ to Sis the 
measure of their intersection divided 
by the measure of the set S’. (Our 
notation makes clear that Wwe have 
made a tacit assumption that the 
measure of an element or set of ele- 
ments is independent of the set in 
which it is measured.) Definition (4) 
also gives the index of similarity of 
S to S’ as 


I 
Ll 5) 
mS) (5) 
m(S) 1(S"’ to 5). 


From this last equation it is clear that 
the similarity of 5’ to S may not be 
the same as the similarity of S to 5, 
In fact, if the measure of the inter- 
section is not zero, the two indices are 
equal only if the measures of Sand S’ 
are equal. It seems regrettable that 
similarity, by our definition, is non- 
symmetric. However, we do not care 
to make the general assumption that 
(a) the Measures of all situations are 
equal and at the same time make the 
n that (b) measures of an 
element or set of elements is the same 
in each situation in Which it appears. 
For then the importance of a set of 
elements, Say a light bulb, would have 
to be the same in 2 small situation, 
EU A box, as in a large 
situation, say a ballroom. Further 
this pair of Assumptions, (a) and (b), 


leads to conceptual difficulties. 
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THE GENERALIZATION PROBLEM 


We are now in a position to say 
something about the first experimental 
problem described in the Introduction. 
An animal is trained to make a re- 
sponse in one stimulus situation and 
then his response strength is measured 
in a similar situation. After the ani- 
mal has been trained in the first situa- 
tion whose elements form the set S, a 
sub-set C of S will have been condi- 
tioned to the response as shown in 
Fig; 2. But part of the sub-set Cis 
also contained in the second situation 
Whose elements form the set 5’; we 
denote this part by Cn 5S’. 

From the discussion preceding equa- 
tions (1) and (2), we can easily see 
that the Probability of the response 
Occurring in S’ is 


2 WUCASS') 
সা) m(S"”) 
We now use the assumption of equal 
Proportions so that 
m(CnS')  m(cn I)  m(C) 
ml) ml) — m(S)' 
The first equality in this equation 


follows from the fact that the only 
Part of C which is in S’ is in the inter- 


(6) 


(7) 


Fic. 2. Lan 
situations after Conditioning in one of them. 


Diagram of two similar stimulus 


The situation in which training SIT ৰ 
denoted by the set 5S; the sub-set C ol Rl 
represents the portion of S which was ই 
ditioned to the response. The new stimulu 
situation in which the response strength eh 
be measured is represented by the set bs : 
and the intersection of S’ and Sis denoted by 
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section I as shown in Fig. 2. The 
second equality in equation (7) is an 
application of our assumption that the 
measure of Cis uniformly distributed 
over S and so the intersection contains 
the same fraction of measure of C as 
does the entire set S. 
If now we combine equations (6) 
and (7), we obtain 
m(I) m(C) (8 
= (SIS) ™ ) 


From equation (4) we note that the 
first ratio in equation (8) is the index 
of similarity of S’ to S, while from 
equation (2) we observe that the sec- 
ond ratio in equation (8) is merely 
the probability 2 of the response in S. 


Hence 
#’ = 1(S' to S)2b. (9) 


Equation (9) now provides us with 
the necessary operational definition of 
the index of similarity, 1(S" to S), of 
the set S’ to the set S. The proba- 
bilities p and #’ of the response in S 
and 5S", respectively, can be measured 
either directly or through measure- 
ments of latent time or rate of re- 
sponding (1). Therefore, with equa- 
tion (9), we have an operational way 
of determining the index of similarity. 

As a direct consequence of our as- 
sumption of equal proportions, we can 
draw the following general conclusion. 
Any change made in a stimulus situa- 
tion where a response was conditioned 
will reduce the probability of occurrence 
of that response, provided the change 
does not introduce stimuli which had 
been previously conditioned to that re- 
sponse. This conclusion follows from 
equation (9) and the fact that we have 
defined our similarity index in such a 
way that it is never greater than unity. 

A word needs to be said about the 
correspondence between our result and 
the experimental results such as those 
of Hovland (3). Our model predicts 
nothing about the relation of the index 
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of similarity defined above to such 
physical dimensions as light or sound 
intensity, frequency, etc. In fact, our 
model suggests that no such general 
relation is possible, 5.e., that any sen- 
sible measure of similarity is very 
much organism determined. There- 
fore, from the point of view of our 
model, experiments such as those of 
Hovland serve only as a clear demon- 
stration that stimulus generalization 
exists. In addition, of course, such 
experiments provide empirical rela- 
tions, characteristic of the organism 
studied, between the proposed index of 
similarity and various physical dimen- 
sions, but these relations are outside 
the scope of our model. 

We conclude, therefore, that our 
model up to this point has made 
no quantitative predictions about the 
shape of generalization gradients which 
can be compared with experiment. 
Nevertheless, the preceding analysis 
of generalization does provide us with 
a framework to discuss experiments on 
stimulus discrimination. In the fol- 
lowing sections we shall extend our 
model so as to permit analysis of such 
experiments. 


THE REINFORCEMENT AND Ex- 
TINCTION OPERATORS 


In this section we develop some re- 
sults that will be used later and show 
that the model of the present paper 
generates postulates used in our pre- 
vious paper (1). We shall examine 
the step-wise change in probability of 
a response in a single stimulus situa- 
tion 5S. We generalize the notions 
already presented as follows: Previous 
to a particular trial or occurrence of 
the response, a sub-set C of S will 
have been conditioned. On the trial 
in question a sub-set X of S will be 
perceived as shown in Fig. 1. Ac- 
cording to our previous assumptions, 
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the probability of the response is 


m(XAC) _ m(C) 
SE m(X)  m(S)° 


We now assume that a sub-set A of X 
will be conditioned to the response as 
a result of the reward given and that 
the measure of A will depend on the 
amount of reward, on the strength of 
motivation, etc. We further assume 
that another sub-set B of X will be- 
come non-conditioned as a result of 
the work required in making the re- 
sponse. For simplicity we assume 
that A and B are disjunct. (The error 
resulting from this last assumption 
can be shown to be small if the meas- 
ures of A and B are small compared 
to that of 5S.) 

We extend our assumption of equal 
proportions so that we have 


m(AnC) _ m(BnC)  m(C) 
mA)  m(B) = m(S) ao) 


Now at the end of the trial being con- 
sidered, sub-set A is part of the new 
conditional sub-set while sub-set B is 
part of the new non-conditioned sub- 


(10) 


set. Thus, the change in the measure 
of Cis 
Am(C) = [m(4) — m(4n C)] 


= m(Bn C) (12) 
m(A)(1 — 2) — m(B)p. 
This last form of writing equation (12) 


results from the equalities given in 
BONS (10) and (11). If we then 
et 


(A) 
m(S)’ 


and divide equation 
m(S), we have finall 


[| 


_ (0B) 
 m(S)’ 


(12) through by 
y for the change 


(13) 


in probability: 
Am(C) 
Ap = lS) all — p) — bp. (14) 


We thus define a mathematical oper- 
ator Q which when applied to p gives 
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a new value of probability Op effective 
at the start of the next trial: 


Ob =p +a(1- PB) — bp. (15) 


This operator is identical to the gen- 
eral operator postulated in our model 
for acquisition and extinction in a fixed 
stimulus situation (1). Hence, the 
set-theoretic model we have presented 
generates the basic postulates of our 
previous model which we applied to 
other types of learning problems (1). 
When the operator Qis applied n times 
to an initial probability po, we obtain 


QOrpo = pn = ps — (po — Po)g”, (16) 


where p. = a/(a +b) and g=1-—a-—b. 

In the next section we shall apply 
these results to the experiment on 
stimulus discrimination described in 
the Introduction. 


THE DISCRIMINATION PROBLEM 


We are now in a position to treat 
the second experimental problem de- 
scribed in the Introduction. An ani- 
mal is presented alternately with two 
stimulus situations S and S’ which are 
similar, t.e., which have a non-zero 


Fic. 3. Set diagram for discrimination 
training in two similar stimulus situations, S 


and 5S’. The various disjunct sub-sets are 
numbered. Set S includes 1, 3, 5, and 6; 
S" includes 2, 4, 5, and 6. The intersection I 
is denoted by 5 and 6. T, the complement 
of I in S, is shown by 1 and 3; T", the com- 
Plement of I in S’, is shown by 2 and 4. C, 
the conditioned sub-set in S,, is represented by 
3 and 6, while the conditioned sub-set in 5°, 
is represented by 4and 6. Tis denoted by 3, 
T¢’ by 4, and I. by 6. 
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intersection. The rewards which fol- 
low occurrences of the response are 
different for the two situations, and 
we are interested in how the response 
strengths vary with training. At any 
point in the process, sub-sets of S 
and 5S’ will be conditioned to the re- 
sponse as shown in Fig. 3. We shall 
distinguish between that part of S 
which is also in S$’ and that part 
which is not by letting I = SNS’ and 
T= 8 — (55) = S — I. We also 
distinguish between the part of the 
conditioned sub-set C of S which is in 
TI and that which is in T, by letting 
IL.=Cnl and T.=C—(CnID=TnC. 
The probability of the response in S is 


m(C) _ m(T.) + me) MD) 


b= HS) m(S) 
Then we let 
m(T.) _ ml) 
(18) B= 0) (19) 


& = R(): * 


and, abbreviating 1(S to S’) with 7, we 
may write (17) in the form 


p=a(t-n) tn. (020) 


We write the probability of the re- 
sponse in this form because we shall 
soon argue that the index 1 varies 
during discrimination training. First, 
however, we shall investigate the varl- 
ation of « and B with the number of 
training trials. From the definitions 
of a and B, equations (18) and (U9), 
we see that these variables are very 
much like our probability of equa- 
tion (17) except that they refer to 
sub-sets of 5 rather than to the entire 
set. By strict analogy with the argu- 
ments in the last section, we conclude 


that 

an = Qrao = Ao (aw — a0)8", (21) 
where as = a/ (a+b) and g=1 ==, 
Now, B, the fraction of conditioned 
stimuli in the intersection I, changes 
with each presentation of 5" as well as 


of 5S. Thus, for each presentation of 
5S, we must operate on B twice, once 
by our operator Q which describes the 
effect of the environmental events in 
5S, and once by an analogous operator 
Q’ which describes the effect of the 
events in S’. Hence, it may be shown 
that 


Bn = (Q’0)"Bo = B= — (B= — Bo) fr, (22) 


where 
d+al-a-b) 
B= Tg +aol-a—-Db) 1 
+9 +৫৭-০-৮)} 
and where f= (1a —-b)(1 —-a—b). 

It should be stressed that we are 
assuming that the response occurs and 
is rewarded to the same degree on 
every presentation of S. The same 
statement, mutatis mutandis, applies 
to 5’. Without this assumption, we 
are not justified in applying the oper- 
ators Q and Q' for each presentation. 
The probability is then the probability 
that the response will occur in an 
interval of time, h. The operational 
measure of this probability is the mean 
latent time, which according to the 
response model discussed earlier varies 
inversely as the probability (DD). 

We now have cleared the way for 
discussing the central feature of our 
model for discrimination problems. 
We conceive that the measure of the 
intersection I of the two sets S and S’ 
decreases as discrimination learning 
progresses. This concept seems to 
make sense intuitively since the meas- 
ure of any sub-set of stimuli indicates 
the importance of that sub-set in in- 
fluencing behavior. Hf an animal is 
rewarded for a response in S but not 
rewarded for it in 5’, then the stimuli 
in I are unreliable for deciding whether 
or not to make the response. And it 
is just this ambiguity which causes 
the measure of the intersection to 
decrease with training. We shall de- 
scribe this change by introducing a 


(23) 
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“discrimination operator," denoted by 
D, which operates on the similarity 
index 7 each time the environmental 
event following the response changes 
from one type of event to another, 
€.g., from reward to non-reward. In 
the present problem, we are consider- 
ing alternate presentations of S and S" 
and thus alternate occurrences of the 
events associated with the operators 
Q and Q'. So if 1; is the ratio of the 
measure of I to that of S after the 
ith presentation of S, the ratio after 
the (4 + 1)th presentation is 


i441 = Dri. (24) 


Our next task is to postulate the form 
of the operator D. 

We find that neither experimental 
data nor our intuition is of much help 
in guiding our choice of such a postu- 
late. Formathematical simplicity we 
choose an operator which represents a 
linear transformation on 1. More- 
over, we wish to have an operator 
which always decreases 7 ‘or holds it 
fixed), but which will never lead to 
negative values of 1. 
postulate that 


Dr = kn, (25) 


where k is a new parameter which is 


in the range between zero and 1. We 
then have 


Therefore, we 


n = Do = kno. (26) 


Combining equations (20), (21), (22), 
and (26), we have 


bn = Qras(1 — Dro) + (Q’O)"BoDr"no 
= [0 — (a2 — ao)" (1 — kno) 
+[B,»- (B»— Bo) fr Jknno. (27) 


This is our final expression for the 
variation of p,, the probability of the 
response in situation S, as a function 
of the trial number n. This equation 
1s composed of two major terms. The 
first term corresponds to the relative 
measures of the stimulus elements of 
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S which are not in S" (the measure of 
T. divided by the measure of JS). 
The second term corresponds to the 
relative measure of the elements in 
the intersection of S and S’ (the meas- 
ure of I. divided by the measure of S). 

Because of the symmetry between 


S and S$’, we may write for the proba- 
bility in S$’: 


bn’ = [aw — (a — a0’) gC — kno’) 
+[B— (B»— Bo) f"Jkrno', (28) 


where a»' = a’'/(a’ +0), and ¢' = 1 
= a’ — b', and where 1’ is the initial 
value of 


1 = 1(S’ to 5) 
_ m(l) _ m(S) 


= HB) = 


m(S") 0 


We shall now consider some special 
examples for which certain simplifying 
assumptions can be made. ঠৰ 

(a) No conditioning before discrimt- 
nation training. If no previous con- 
ditioning took place in either S or S’, 
it seems reasonable to assume that 
the “operant” levels of performance 
in the two situations are the same. 
Moreover, in view of our assumptions 
of equal proportions, we may assume 
that initially: 


m(C) _ m(T.)  m(I) 
m(S)  m(T) ™ m(D) 
= m(T') _ m(C’) (30) 
MCT). 0059) 


Hence, from equations (17), (18), and 
(19), we have po = av = ao = Bo. 
Moreover, inspection of equation (27) 
shows that, except when k = 1, we 
have p» = a», and in like manner 
from equation (28) for k = 1, we have 
Pb» = ax’. In Fig. 4 we have plotted 
equations (27) and (28) with the above 
assumptions. The values a = 0.12, 
b = 0.03, po = 0.05, 10 = 10’ = 0.50, 
k = 0.95 were chosen for these calcu- 
lations. As can be seen, the proba- 
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Fic. 4. Curves of probability, p (in S), 
and p' (in S'), versus trial number, #1, for 
discrimination training without previous con- 
ditioning. It was assumed that the response 
was rewarded in S but not rewarded in S'. 
Equation (27), equation (28), and the values 
p= p= 0.05, a = 0.12, a’=0, b=b 
= 0.03, 10 = 10 = 0.50, and k = 0.95 were 


used. 


bility of the response in Sis a mono- 
tonically increasing, negatively accel- 
erated function of the trial number, 
while the probability in SY first in- 
creases due to generalization, but then 
decreases to zero as the discrimination 
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FIG. 5. Reciprocals of probability, 2, of 
the response in S, and 2’, of the response in 5 ্ 
versus trial number, for discrimination 
training without previous conditioning. In 
the model described earlier (1), mean latent 
time is proportional to the reciprocal of prob- 
ability. The curves Were plotted from the 
values of probability shown in Fig. 4. 
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is learned. These curves describe the 
general sort of result obtained by 
Woodbury for auditory discrimination 
in dogs (4). 

We have argued (1) that the mean 
latent time varies inversely as the 
probability. Thus in Fig. 5 we have 
plotted the reciprocals of pn and pa’ 
given in Fig. 4. These curves exhibit 
the same general property of the ex- 
perimental curves on running time of 
rats obtained by Raben (5). 

(b) Complete conditioning in S before 
discrimination training. Another spe- 
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Fi1G. 6. Curves of probability, 2, and its 
reciprocal versus trial number, #, for the case 
of complete conditioning in S before the dis- 
crimination training. Equation (27) with the 
values ps = 1, Be = 0, m0 = 0.80, k= 0.90, 
and f = 0.50 were used. 


cial case of interest is that in which 
the set S is completely conditioned to 
the response before the discrimination 
experiment is performed. In this case, 
ay = Bo = bo = bo. In Fig. 6 we have 
plotted pn and 1/pn with these condi- 
tions and the values po = 1, Bw = 0, 
1 = 0.80, # = 0.90, and f= 0.50. 
The curve of 1/p versus 7 is similar 
in shape to the experimental latency 
curve obtained by Solomon (6) from 
a jumping experiment with rats. 

(c) Limiting case of S and S’ identi- 
cal. Another limiting case of the kind 
of discrimination experiment being 
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considered here obtains when we make 
the two stimulus situations S and S’ 
identical. The problem degenerates 
into one type of partial reinforcement 
where, for example, an animal is re- 
warded on every second trial in a fixed 
stimulus situation. The intersection 
I of S and S’ is of course identical to 
both S and S$’. Thus the measure of 
I must equal the measure of S. From 
equation (5), we have 


m(I) 
m(S) =, যু) 


while according to our postulate about 
the operator D, equation (26), the 


similarity index varies from trial to 
trial: 


1 = Brno. (32) 


For S and S$’ identical, the above two 
equations are incompatible, unless we 
take k = 1. Thus, we are forced to 
assume that k depends on how many 
cues are available for discrimination 
in such a way that k = 1 when none 
are available. Moreover, since I and 
S are identical, the measure of T, the 
complement of I in S, must be zero. 
Since T. is a sub-set of T, the measure 
of T. must also be zero. Therefore, 
equations (17) and (19) give in place 
of equation (20) 


2 = Bn. (33) 


Just argued that for S 
al, we have 1 = 1. Thus 


b=. (34) 
Equation (22) gives us then 


But we have 
and S’ identic 


bn = (OO), = pu — (pe — bo) f". (35) 


This equation agrees with our previous 
result on partial reinforcement : 


,(d) Irregular bresentations of S and 
S'. In most experiments, S and S" 
are not presented alternately, but in 
an irregular sequence so that the ani- 
mal cannot learn to discriminate on 
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the basis of temporal order. A simple 
generalization of the above analysis 
will handle the problem. The usual 
Procedure is to select a block of (J hy) 
trials during which S is presented 5 
times and S’ presented j’ times. The 
actual sequence is determined by draw- 
ing “S balls” and “‘S’ balls” at random 
from an urn containing j “S balls" 
and j’ “S’ balls.” This sequence is 
then repeated throughout training. 
In our model, we can describe the 
effects on the probability of a known 
sequence by an appropriate applica- 
tion of our operators Q, Q', and D for 
presentations of 5S, presentations of 
S', and shifts from one to the other, 
respectively. A less cumbersome 
method provides a reasonable approxi” 
mation: for each block of (j +) 
trials we describe an effective or ex- 
pected new value of probability by 
applying Q to its operand j times, Q 
to its operand j’ times, and D to. the 
index 7 a number of times determined 
by the mean number of shifts from 5 
to S. For the special case of j =] 
the mean number of shifts is Jj. Since 
previously, we applied D to 1 for each 
batr of shifts, we write for the (541)th 
block of (2j) trials 


FA = Oat — Ding 
bi Qiai( ge (0'0)i8.Dn.. (36) 


The rest of the analysis exactly par- 
allels that given above for the case of 
alternate presentations of S and S’. 
The results will be identical except 
for the value of involved in the 
operator D. 

SUMMARY 


A mathematical model for stimulus 
generalization and discrimination is 
described in terms of simple set-theo- 
retic concepts. An index of similarity 
is defined in terms of the model but 1s 
related to measurements in generaliza- 
tion experiments. The mathematical 
operators for acquisition and extinc- 
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tion, discussed in an earlier paper (1), 
are derived from the set-theoretic 
model presented here. The model is 
finally applied to the analysis of ex- 
periments on stimulus discrimination. 


[MS. received October 13, 1950] 
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TWO-CHOICE BEHAVIOR OF PARADISE FISH 
ROBERT R. BUSH AND THURLOW R. WILSON 


Harvard University ! 


Our problem stems principally from 
two experiments. Brunswik (1) ob- 
served the acquisition of a position 
discrimination by rats when food was 
placed more frequently in one box. 
Research by Humphreys (9) was 
comparable in that S had two choices 
with partial reinforcement of both. 
He required college students to guess 
on every trial whether or not a light 
would flash, and then in accordance 
with a predetermined schedule, the 
light did or did not flash. The 
Humphreys study exemplifies a non- 
contingent procedure for two-choice 
learning since the flash of the light 
did not depend upon the choice made 
by S. Brunswik’s rats faced a con- 
tingent situation since the environ- 
mental chanye, presentation of food, 
was contingent in part on S’s response. 
A contingent two-choice research on 
humans has been performed by Good- 
now (2, pp. 294-296). Her Ss de- 
cided on every trial which of two 
buttons to press. If the choice was 
correct, they earned a poker chip, 
otherwise not. Human two-choice 
learning with partial reinforcement 
has been further observed under 
‘contingent procedure (3) and under 
noncontingent procedure (8,4 35,6 
7, 8, 10). 

Bush and Mosteller (2) suggest 
that these two types of procedures are 
associated with different forms of 
asymptotic choice distribution (choice 
distribution after learning) for the 


1 This research was supported by the Labora- 
tory of Social Relations, Harvard University. 
We are indebted to W. 'S. Verplanck for sug- 
gesting that we use fish in learning experiments 
and to F. Mosteller for numerous suggestions 
and criticisms. 


This article appeared in J. ezp. P: 


sychol., 1956, 51, 315-322. 
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individual Ss. In general, most Ss 
in a contingent experiment are found 
to have an asymptotic choice distri- 
bution of I100¢% selection of the 
favorable alternative.  Noncontin- 
gent situations give rise to other kinds 
of choice distributions; in such experi- 
ments, the asymptotic proportion © 
choices of the favorable alternative 
has been observed to match the a 
portion of reinforcements schedule 
for the alternative. 

We attempted to obtain the er 
contingent results with nonhuman i 
Red paradise fish were confronted of 
a position discrimination with FETE 
reinforcement in which one side te 
correct a random 75% and the othe! 
side correct for the remaining 25%: 
The apparatus was a discrimination 
box with adjacent goal SOMPETCT 
For the experimental Ss, E placed th 
food in the correct compartment 
regardless of whether S§ had entere 
the correct goal box; the Neo 
between the two goal boxes যে 
transparent for the experimen রঃ 
group so that these Ss were able be 
see the food in the correct compar 
ment when they had chosen incor- 
rectly. ‘The control group was oH 
with an opaque divider separating t bs 
£oal compartments in order to et 
conditions comparable to those usec 
by Brunswik. 


THEORY 


We attempt to describe the fe 
mental data within the framework 0 


2 Besides contingent and rioricontingeT Mie 
cedure, other kinds of factors, such asa ECT ay 
versus a problem-solving orientation, have (6) 
related to asymptotic choice distribution 
We shall not deal with these other factors. 


Reprinted with permission. 
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stochastic model given by Bush and 
Mosteller (2). On trialn (wheren = 0, 
1.25.5 there exists a probability Dn 
that § will choose the more favorable 
side. One of four events occurs on this 
trial and each leads to a different value 
of pari AS in similar analyses, we 
assume that the effect of feeding is 


Event 
favorable side, food 
favorable side, no food 
unfavorable side, food 
unfavorable side, no food 


The model previously used for ana- 


lyzing two-choice experiments using the 
contingent procedure is obtained from 
the above table by imposing the further 
restriction that as = 1. This assump- 
tion implies that nonfeeding is an event 
which does not alter the response proba- 
bilities. It was expected that this model 
would describe learning by the control 
group in the present experiment. Given 
this specific model, it can be shown that 
the asymptotic P for each § will be either 
1.0 or 0; for the 75:25 schedule it is 
predicted that a high percentage of Ss 
will tend towards 1.0. The exact per- 
centage depends upon the value of a1. 
We propose two specific models for the 
experimental group of the present ex- 
periment. These models are obtained 
from the foregoing table by imposing 
two different sets of additional restric- 
tions which in turn are suggested by two 
different theories of learning. The first 
specific model, herein called the in- 
formation model, is obtained by taking 
a; = a? and A=0. Asa result, the 
first and fourth listed events in the fore- 
going table have the same effect on pri 
they correspond to food being placed on 
the favorable side. Similarly, the second 
and third listed events have the same 
effect; they correspond to food being 
placed on the unfavorable side. These 
restrictions appear to arise most readily 
from a cognitive learning point of view, 
because each trial may be described as 
providing information about the payoff 


apn + (1- a) 
azn t+ (1 -— as)A 
apn 


aa + (1-0 -— A) 
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symmetrical for the two goal boxes; we 
make a similar assumption for non- 
feeding. In addition, we assume that a 
long sequence of feedings on one side 
would tend to make the probability of 
going there unity. These special as- 
sumptions reduce the general model to 
the following statements about pnt 


Probability 
Dnt of occurrence 

75pn 

25pn 

2501 — 2) 

T5(— pn) 


schedules. This information model is 
equivalent to the models used by Bush 
and Mosteller (2) and by Estes (4) for 
describing human experiments with the 
non-contingent procedure. 

The other specific model for the ex- 
perimental group, herein called the 
secondary reinforcement model, is obtained 
from the additional restrictions, A= 1 
and a2> a1. This model assumes that 
when § enters one goal box and sees food 
in the other goal box it is secondarily 
reinforced for the response just made. 
It has been shown that this model pre- 
dicts that each S$ will have an asymptotic 
p of 1 or 0 and that more Ss will tend 
towards 1 than 0. The precise pro- 
portion that tend towards 1 depends on 
the values of a1 and a. 

We are chiefly concerned with pre- 
dictions about the forms of the asym- 
ptotic distributions of choices of the 
favorable side. These predictions could 
be tested experimentally by running 
many trials in the experiment and ob- 
taining a proportion of choices for each 
S$ during, say, the last 100 trials. The 
proportions thus obtained would form 
a distribution which could be compared 
with the predicted ones. Unfortunately, 
the mathematical analysis presented by 
Karlin (11) suggests that the con- 
vergence of the distributions of these 
models is very slow. Therefore, a great 
many trials would be required in the 
experiment to obtain the desired distri- 
bution. In view of these considerations, 
we are forced to examine the “‘near- 
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asymptotic” distributions. The infor- 
mation model predicts that such a 
distribution will be clustered around a 
point just below .75, whereas the 
secondary reinforcement model predicts 
thatit will be U-shaped with a peak near 
1 and a somewhat smaller peak near 0. 
The model for the control group (as = 1) 
also predicts a U-shaped near-asymptotic 
distribution, but the peak near 0 should 
be very small compared to that for the 


secondary reinforcement model. These 
predictions are compared with data 
below. 

METHOD 


Subjects.— The Ss were 49 red paradise fish, 27 
in the control group and 22 in the experimental 
group. The red paradise fish (Macropodus 
opercularis) is a hardy tropical fish about 2 in. 
in length selected because of its small demands 
for care. ‘The Ss were housed separately in 
tanks with a water temperature of 80° 2 1°F. 
‘This was the temperature indicated by our 
feeding studies for maximum appetite. Lighting 
was by fluorescent fixtures which were auto- 
matically turned on for a standard 12-hr. period 
each day to control the activity cycle. (This 
fish has a diurnal rhythm of activity.) 

Apparatus.—The apparatus was a discrimi- 
nation box as shown in Fig. 1. The maze was 
constructed of f-in. opaque white Plexiglas, 
except for parts of the goal boxes. The control 
group had a white opaque divider, whereas for 
the experimental group this divider was trans- 
parent. For one goal box the side opposite the 
entrance to the box was formed from a piece of 
opaque light yellow plastic; the corresponding 
side of the other box was white opaque. These 
sides could be interchanged. (Exploratory 
studies indicated that a position discrimination 
with identical goal boxes is learned very slowly 
by these fish.) 

The apparatus was placed in a 10-gal. tank 
shielded from room lights. Lighting came 
largely from a 75-w. spotlight 2 ft. above the 
maze and focused on the start chamber. Care 
was taken to ensure that water conditions of this 
experimental tank were as close as possible to 
those of the home tanks of Ss. 

Feeding.— The experimental food was pre- 
pared fish eggs from an inexpensive (10 cents an 
ounce) caviar (“Lumpfish caviar” packed by 
Hansen Caviar Co., New York, N. Y.). These 
eggs were found to be a highly preferred food of 
the paradise fish and were convenient to obtain 
and store. The eggs were presented singly; 
the egg was held on the end of a medicine dropper 
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apparatus. 


by suction (the egg was 1 mm. to 2 mm. In 
diameter and larger than the opening of the 
dropper). To secure the cgg, the fish was 
obliged to pull it from the dropper. A fish was 
required to earn all of its food by solving the 
discrimination problem. 

Pretraining.—The pretraining took two or 
three days. On the first day the fish was fed 
eggs (10 or 20) by eye dropper in its home tank. 
For the next one or two days the fish underwent 
forced trials (10 or 20) in the maze. Half of the 
forced trials were to the right-side goal bots 
About one-third of the fish were rejected from 
the experiment at the end of pretraining or after 
one or two days of discrimination training 
leaving 49 Ss. (Fish were rejected because they 
would not eat in the apparatus or because 
made an error in procedure.) 40 

Procedure.—All Ss received a total of 1 
trials, 20 trials a day or less. One goal box (the 
favorable side) was scheduled for reinforcement 
on 75% of the trials while the other goal was 
scheduled for reinforcement on the remaining 
trials. On a given trial only one goal box de 
correct. The trials for which the favorable sice 
was incorrect were selected by restricted. a 
domization within blocks of 20. The restriction 
was that runs of incorrect could not be longer 
than two. All fish had the same schedule. 

The right, yellow side was favorable for ৰ 
one-fourth of the Ss; right, white for one-fourth; 
left and yellow for one-fourth, and left and white 
for one-fourth. 

‘The procedure for the control group was as 
follows. The fish was released from the start 
chamber, and it swam down to the goal ER 
If the fish poked its nose into the goal box which 
was correct for that trial, E lowered a neat 
dropper with a fish egg into the COMPIT Te 
(the dropper was secured to an arm) OE 
the fish to feed. If the fish entered the incorrect 
goal box, no food was placed in the goal box. In 
either case, the fish was chased back into the 
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start chamber after 3-4 sec. in the goal box. 
This was accomplished with a piece of plastic 
of width slightly less than the width of the maze; 
the fish quickly developed avoidance tendencies 
to this “paddle.” As soon as it was lowered into 
the tank, the fish promptly returned to the start 
chamber. The interval between trials was 12 
sec. No retracing was permitted. Except for 
the transparent rather than opaque piece di- 
viding the two goal boxes, the procedure for the 
experimental group differed only in one detail: 
after the fish had entered a compartment E 
placed the medicine dropper with a fish egg in 
the correct goal box. Tf the fish had entered the 
correct goal box, it secured the egg. Otherwise 
the fish could see the egg through the transparent 
divider but could not obtain it. Observations 
indicated that they did in fact see the egg on 
most of these trials. 


RESULTS AND DiscussioN 


Initial preferences.—Position and 
color preferences may strongly in- 
fluence the results of a discrimination 
study. For this reason the balanced 
design described in the preceding 
section was used. This technique, 
however, tends to eliminate a group 


TABLE 1 

UTION OF CHOICES OF THE 
DurinG THE FIRST TEx 
Grours oF Fisu 


OssERVED DIsTRIB 
FAVORABLE SIDE 
TRIALS FOR THE Two 
CoMBINED, AND THE ‘THEORETICAL 


DistRisUTIONS FOR THE BINOMIAL 
MovEL (P= 5) AND FOR THE 
SYMMETRIC BETA DistRt- 
BUTION WITH 


I. 


Predicted 
eee | Me FETE 
i al Binomial | Beta 

EE EEE ee PET 
0 2 0.05 2.34 
3 0:48 371 
ঠ 5 2:16 266 
3 7 575 5:28 
4 4 10.06 5.64 
5 8 1207 | 576 
৫ 8 1006 | 5.64 
7 1 5.75 5.28 
8 5 216 | +66 
a 2 048 371 

0 4 005 | 234 
49 49.07 49.02 
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preference only. From an analysis 
of variance of the responses on the 
first 10 trials, we concluded that there 
were no group color or position prefer- 
ences but that there were individual 
preferences. The stochastic models 
used in analyzing the data are sensi- 
tive to the entire distribution of initial 
probabilities, not only its mean. 
Therefore, it is necessary to consider 
the actual distribution. 

One binomial observation for each 
initial probability is insufficient to 
determine anything about the initial 
distribution except the mean. Thus 
we must look at the number of suc- 
cesses (choices of the favorable side) 
by each fish during the first several 
trials and assume that the probability 
for each fish does not appreciably 
change during these trials. For this 
purpose the two groups of Ss were 
combined, giving an N of 49, and the 
first 10 trials of the data were used. 
In Table 1 we show the frequencies 
of choice observed as well as those 
predicted by two models which are 
now briefly discussed. 

The mean number of observed 
successes during the first 10 trials is 
496 and so the balanced design ac- 
complished its purpose. But, if we 
assume that each of the 49 fish had a 
binomial probability of 55, the pre- 
dicted frequencies of choices are those 
shown in the third column of Table 1. 
The discrepancies are highly sig- 
nificant. The likelihood ratio test 
(12, Pp. 257) (this is essentially the chi- 
square test) leads: to: P< .005. 
Therefore we consider an alternative 
assumption: that the initial distri- 
bution is a symmetric beta dis- 
tribution (12, p. 115) with a mean 
of .5. It may be written in the form 


f{0) = CIPO - DI, 


where Cis a constant chosen so that 
the total density is unity, and where 
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Fic. 2. Learning curve for each of the two 
groups of fish and for the 22 stat-fish which 
parallel the experimental group. Mean pro- 
portion of choices of the favorable side is plotted 
for each block of 10 trials. 


sis a parameter which determines the 
spread of the distribution. ‘The 
method of maximum likelihood (12, 
PP. 152-160) was used to estimate s 
from the data, giving .7 as the esti- 
mate. The distribution of Successes 
during 10 trials can then be computed. 
The results are shown in the last 
column of Table 1 and the likelihood 
ratio test gives P = 4. This fit was 
considered satisfactory. 

Learning curves.— In Fig. 2 we 
show the proportion of Successes in 
blocks of 10 trials for each of the two 
Sroups of fish. It is clear that the 
control group learned more rapidly 
than the experimental group, but 
little more can be inferred from this 
figure. Onecan conjecture, of course. 
that the sight cf food in the opposite 
§0al box when food was not obtained 
slowed down the learning process. 
Just how this comes about can be 
determined only by a more detailed 
analysis of the data. 

We hasten to note at this Point that 
the models described above do not 
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predict the relative rates of learning 
of groups of Ss run under different 
experimental conditions. Within the 
framework of the models, rates of 
learning are determined by the values 
of parameters which must be esti- 
mated from data. The models do 
predict, however, other properties of 
the data considered in the following 
sections. N 

The near-asymptotic distributions.— 
The two specific models for the experi- 
mental group—the information model 
and the secondary reinforcement 
model—make very different predic- 
tions about the shape of the distri- 
butions of successes after learning is 
nearly complete. In the second 
column of Table 2 we show the 
frequencies of successes during the last 
49 trials (the number of successes 
varies from 0 through 49). ‘The ob- 
served U-shaped near-asymptotic dis- 
tribution is not determined by initial 
preferences alone; the rank-order cor- 
relation coefficient between the num- 
ber of favorable choices on the first 
and last 10 trials is The in- 
formation model predicts a clustering 


TABLE 2 


DistRisuTioN oF SuccEssE: (CHoicEs oF THE 
FAvoRABLE Sine) DurinG THE Lasr 49 
‘TRIALS FoR THE Two Grours oF Fisu 
AND FOR THE 22 STAT-FisH WuicH 
PARALLEL THE EXPERIMENTAL 
Grove oF Rear Fisu 


tL Fxperi- -ontrol 
SUTiber mental | Stat-Fish Goup 
* Group Ee 

4 || 

| j | 0 

2 0 0 

0 0 1 

[) 0 2 

0 1 Sl 

] 0 2 

2 2 5 

2 ডু 7 

10 10 1 

22 22 27 
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around 37 but this prediction is 
clearly not confirmed by the experi- 
mental group data. The secondary 
reinforcement model, on the other 
hand, predicts a U-shaped distri- 
bution with greater density at the 
high end than at the low end. This 
prediction is confirmed. On this basis 
alone we can choose the secondary 
reinforcement model in favor of the 
information model. Detailed ques- 
tions of goodness of fit are considered 
in the following sections. 

The model proposed for the control 
group involves the assumption that 
nonreward has no effect (02: = 1) and 
it predicts that the near-asymptotic 
distribution of successes will also be 
U-shaped but with very small density 
at the low end. This indeed agrees 
with the data shown in the last column 
of Table 2; one out of 27 fish stabilized 
at the unfavorable side—it chose that 
side 46 times during the last 49 trials. 
‘The other 26 fish either stabilized on 
the favorable side or did not yet 
stabilize during the trials run. In 


the next section we consider the basic 


assumption that as = 1 made in the 


model for the control grouf 

Parameter estimates.—Hav 
ary reinforcement model for 
oup, we need to 
reward parameter, 


D 
ing chosen 


the second 
the experimental gr 
estimate the primary 
an, and the secondary reward parameter, 
as. These estimates are required for 
two reasons: (a) we wish to measure 
the relative effects of primary and 
secondary reinforcement in this experi- 
ment (the smaller the value of a, the 


TABLE 3 


Two PARAMETERS OBTAINED 


ESTIMATES OF THE 
1E Two GRrouPs oF Fisn 


FOR EACH OF TF 


l Experi; | Control 
Parameters Er Group 
EE 
Primary retard 0.916 0.956 
rimary reward, a1 0942 0.986 


Secondary reward, az 
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greater the effect), and (2) the estimates 
are used in measuring goodness of fit of 
the model to the data in a detailed way. 
For the control group, we assume that 
the same model applies and then esti- 
mate both parameters and determine 
whether or not the assumption that 
as = 1 is tenable. 

The procedure used to estimate the 
two reward parameters cannot be de- 
scribed in detail here. It uses the first 
three moments of the observed distri- 
butions of successes in each block of 10 
trials; these are used in conjunction with 
formulas for moments of the p-value 
distributions derived by Bush and 
Mosteller (2, p. 98). The results, how- 
ever, are shown in Table 3. It can be 
noted that the secondary reward pa- 
rameter, 2, is larger for both groups 
than the corresponding primary reward 
parameter, a1. This confirms the ex- 
pectation that primary reward is more 
effective. (A small value of & implies a 
more effective event than does a large 
value.) For the control group, the 
value of az is near 1.0 as assumed in the 
model for the control group, but the 
fact that it is not quite 1.0 suggests that 
nonreward is slighty reinforcing even 
for the control group. The result that 
a1 is less for the experimental group than 
for the control group (primary reward 
more effective) is not predicted by any 
of the models. 

The relative effects of primary and 
secondary reward for each group can be 
estimated as follows. We note that 
(916): = .942 and this means that 
secondary reward is about 60% as 
effective as primary reward for the 
experimental group. Similarly, (.956) 3 
= .986, and so secondary reward is about 
30% effective for the control group. 
These percentages may be in error ap- 
preciably because of the sampling errors 
in the parameter estimates, but they do 
indicate roughly the effects. f 

Stat-fish.—A convenient way of com- 
paring model predictions with data is to 
run Monte Carlo computations or “stat- 
fish” as described elsewhere (2, pp. 129- 
131, 251-252). One hundred runs of 
140 trials each were carried out on IBM 
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machines? using the parameter values 
given in Table 3 for the experimental 
group. From these 100 runs, a stratified 
sample of 22 runs was drawn such that 
the initial distribution of probabilities 
would approximate the symmetric beta 
distribution with the parameter s = .7. 
These 22 stat-fish can then be compared 
directly with the 22 paradise fish in the 
experimental group. 

The “learning curve” of the stat-fish 
is shown in Fig. 2 along with those of the 
real fish. It can be seen that the stat- 
fish curve is slightly above the curve for 
the experimental group. This should 
not be interpreted as a discrepancy 
between the model and the data. Rather 
it is some indication of how well the 
model parameters were estimated from 
the data. Loosely speaking, the esti- 
mates were obtained by requiring that 
the learning rates of the model popu- 
lation and of the experimental sample 
be equal. To measure goodness of fit 
we must look at other properties of the 
data. 

The near-asymptotic distribution of 
successes of the 22 stat-fish was obtained 
in the same manner as for the real fish. 
The results are shown in the third column 
of Table 2 and are sufficiently close to the 
corresponding frequencies of the experi- 
mental group that we consider formal 
tests for goodness of fit would be super- 
fluous. 

Many sequential properties of the data 
can be compared to the corresponding 
Properties of the stat-fish “data” in 
order to obtain further measures of 
goodness of fit. Thus we have tabulated 
the distribution of runs (of successes and 
failures) for the experimental group and 
for the stat-fish. In Table 4 we show 
the mean and SD of the total number of 
runs, of the number of runs of various 
lengths, as well as the number of suc- 
cesses per S§. Jt can be seen that all 
but one of the tabulated means are 
slightly smaller for the real fish than for 
the stat-fish, and that the variability of 
these measures is less for the real fish. 


3 We are indebted to B. P. Cohen and P. D. 
Seymour for making these computations. 
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TABLE 4 


COMPARISON OF STATISTICS COMPUTED FROM THE 
DATA FOR THE EXPERIMENTAL GROUP OF 
22 FisH AND FROM THE SEQUENCES 
OBTAINED FROM THE 22 STAT-FISH 


el Stat-Fish 
be cE Group 
Statistic 
Mean | SD |Mean| SD 
‘Total number runs | 27.3 20.2 
Runs of length 1 12.9 10.1 
Runs of length 2 4.4 6.3 
Runs of length 3 2.0 2.6 
Runs of length 4 18 2.2 
Runs of length 5 A 14 
Number successes | 81.3 | 48.0 | 87.6 | 48.2 


All these discrepancies are a result of 
the fact that two of the stat-fish never 
chose the unfavorable side and two 
others chose it only once each. These 
four stat-fish had initial success proba- 
bilities of .95, .95, .85, and .85, respec- 
tively. The smallest number of failures 
by the real fish is five. This suggests 
that better agreement would have been 
found if the initial distribution of proba- 
bilities had had less density in the 
extremes; the symmetric beta distri- 
bution was used only as an approxi- 
mation to the true initial distribution. 
Furthermore, learning during the first 
10 trials tends to spread out the distri- 
bution of response probabilities and so 
the true initial distribution probably had 
less variance than the symmetric beta 
distribution used in the stat-fish com- 
putations. 

The distributions of the statistics 
given in Table 4 for the real fish and 
stat-fish can be compared in the same 
manner as used to compare two groups 
of Ss. The distributions are not normal 
and so we used the Mann-Whitney test 
(13). Comparison of each of the seven 
statistics listed in Table 4 led to P 
values greater than .3. Thus, we con- 
clude that the model adequately de- 


scribes much of the fine-grain character 
of the data. 
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SUMMARY 


A two-choice experiment designed to provide 
Ss with complete information about the out- 
comes of each choice on each trial is described. 
The Ss were 49 red paradise fish divided into 
two groups; the control Ss were run with the 
conventional procedure whereas the experi- 
mental Ss were given an opportunity to observe 
the presence or absence of food on both sides of 
the maze. Both groups were rewarded on one 
side 75% of the time and on the other side the 
remaining 25% of the time. 

Two stochastic models for predicting the 
behavior of the experimental group are dis- 
cussed. ‘The “information model” assumed an 
increment in the probability of a fish choosing 
on a particular trial the side on which food was 
placed on the preceding trial. This model 
predicts that the distribution of choices ap- 
proaches about .75 for all fish. The “‘secondary 
reinforcement model,” on the other hand, as- 
sumes that sight of food in the opposite goal box 
reinforces the response just made and predicts 
that individual fish will approach 100% choice 


of one side or the other. 
The data obtained 
reinforcement model. 
ure the effectiveness of primary 
ard are estimated from the data and then 
comparisons between model predictions 
] results are made. It is con- 
y reinforcement model 
h of the fine-grain 


support the secondary 
Parameters which meas- 
and secondary 


TCW. 
detailed 
and experimental! 
cluried that the secondar’ 


adequately describes muc 
structure of the data. 
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TOWARD A STATISTICAL THEORY OF LEARNING * 


BY WILLIAM K. ESTES 


Indiana University 


Improved experimental techniques 
for the study of conditioning and simple 
discrimination learning enable the pres- 
ent day investigator to obtain data 
Which are sufficiently orderly and re- 
Producible to support exact quantita- 
tive predictions of behavior. Analogy 
with other sciences suggests that full 
utilization of these techniques in the 
analysis of learning processes will de- 
Pend to some extent upon a comparable 
refinement of theoretical concepts and 
methods. The necessary interplay be- 
tween theory and experiment has been 
hindered, however, by the fact that 
none of the many current theories of 
learning commands general agreement 
among researchers. It seems likely that 
Progress toward a common frame of 
reference will be slow S0 long as most 
theories are built around verbally de- 
fined hypothetical Constructs which are 
not susceptible to unequivocal verifica- 
tion. While awaiting resolution of the 
many apparent disparities among com- 
Peting theories, it may be advantageous 
to systematize well established empiri- 
cal relationships at a Peripheral, statis- 
tical level of analysis. The Possibility 
of agreement on a theoretical frame- 
Work, at least in certain intensively 
studied areas, may be maximized by 
defining concepts in terms of experi- 
mentally manipulable Variables, and 
developing the consequences of assump- 
tions by strict mathematical reasoning. 


This essay will introduce a series of 


* For continual reinforcement of his efforts 
at theory construction, as well as for many 
specific criticisms and Suggestions, the writer 
is indebted to his colleagues at Indiana Uni- 
versity, especially Cletus J. Burke, Douglas 
G. Ellson, Norman Guttman, and William 5. 
Verplanck. 


This article appeared in Psychol. Rev 


1950, 57, 94-107. 
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studies developing a statistical theory of 
elementary learning processes. From 
the definitions and assumptions which 
appear necessary for this kind of for- 
mulation, we shall attempt to derive 
relations among commonly used meas- 
ures of behavior and quantitative ex- 
pressions describing various simple 
learning phenomena. 


PRELIMINARY CONSIDERATIONS 


Since propositions concerning psy- 
chological events are verifiable only 
to the extent that they are reducible to 
predictions of behavior under specified 
environmental conditions, it appears 
likely that greatest economy and con- 
sistency in theoretical structure will 
result from the statement of all funda- 
mental laws in the form 

R= (5S), 

where R and S represent behavioral 
and environmental Variables respectively. 
Response-inferred laws, as for example 
those of differential Psychology, should 
be derivable from relationships of this 
form. The reasoning underlying this 
Position has been developed in a recent 
Paper by Spence (8). Although devel- 
Oped within this general framework, the 
Present formulation departs to some ex- 
tent from traditional definitions of S 
and R variables. 

Many apparent differences among 
contemporary learning theories seem to 
be due in part to an oversimplified defi- 
nition of stimulus and response. The 
view of stimulus and response as ele- 
mentary, reproducible units has always 
had considerable appeal because of its 
simplicity. This simplicity is deceptive, 
however, since it entails the postulation 
of various hypothetical Processes to ac- 


Reprinted with permission. 
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count for observed variability in be- 
havior. In the present formulation, we 
shall follow the alternative approach of 
including the notion of variability in 
the definitions of stimulus and response, 
and investigating the theoretical conse- 
quences of these definitions. 

It will also be necessary to modify 
the traditional practice of stating laws 
of learning in terms of relations between 
isolated stimuli and responses. At- 
tempts at a quantitative description of 
learning and extinction of operant behav- 
ior have led the writer to believe that 
a self-consistent theory based upon the 
classical S-R model may be difficult, if 
not impossible, to extend over any very 
wide range of learning phenomena with- 
out the continual addition of ad hoc 
hypotheses to handle every new situ- 
ation. A recurrent difficulty might be 
described as follows. In most formula- 
tions of simple learning, the organism 
is said originally to “do nothing” in 
the presence of some stimulus; during 
learning, the organism comes to make 
some predesignated response in the pres- 
ence of the stimulus; then during ex- 
tinction, the response gradually gives 
way to a state of “not responding” 
again. But this type of formulation 
does not define a closed or conservative 
system in any sense. In order to derive 
properties of conditioning and extinc- 
tion from the same set of general laws, 
it is necessary to assign specific proper- 
ties to the state of not responding 
which is the alternative to occurrence 
of the designated response. One solu- 
tion is to assign properties as needed 
by special hypotheses, as has been done, 
for example, in the Pavlovian concep- 
tion of inhibition. In the interest of 
simplicity of theoretical structure, we 
shall avoid this procedure 50 far as 
possible. ্্‌ 

The role of competing reactions has 


been emphasized by some Writers, but 
usually neglected in formal theorizing. 
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The point of view to be developed here 
will adopt as a standard conceptual 
model a closed system of behavioral and 
environmental variables. In any spe- 
cific behavior-systern, the environmental 
component may include either the en- 
tire population of stimuli available in 
the situation or some specified portion 
of that population. The bebavioral 
component will consist in mutually ex- 
clusive classes of responses, defined in 
terms of objective criteria; these classes 
will be exhaustive in the sense that they 
will include all behaviors which may 
be evoked by that stimulus situation. 
Given the initial probabilities of the 
various responses available to an organ- 
ism in a given situation, we shall expect 
the laws of the theory to enable predic- 
tions of changes in those probabilities 
as a function of changes in values of 
independent variables. 


DEFINITIONS AND ASSUMPTIONS 


1. R-variables. It will be assumed 
that any movement or sequence of 
movements may be analyzed out of an 
organism’s repertory of behavior and 
treated as a “response,” various prop- 
erties of which can be treated as de- 
pendent variables subject to all the laws 
of the theory. (Hereafter we shall ab- 
breviate the word response as R, with 
appropriate subscripts where neces- 
sary.) In order to avoid a common 
source of confusion, it will be necessary 
to make a clear distinction between the 
terms R-class and R-occurrence. 

The term R-class will always refer 
to a class of behaviors which produce 
environmental effects within a specified 
range of values. This definition is not 
without objection (cf. 4) but has the 
advantage of following the actual prac- 
tice of most experimenters. It may be 
possible eventually to coordinate R- 
classes defined in terms of environ- 
mental effects with R-classes defined in 
terms of effector activities. 
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By R-occurrence we shall mean 
a particular, unrepeatable behavioral 
event. All occurrences which meet the 
defining criteria of an R-class are 
counted as instances of that class, and 
as such are experimentally interchange- 
able. In fact, various instances of an 
R-class are ordinarily indistinguishable 
in the record of an experiment even 
though they may actually vary with 
respect to properties which are not 
Picked up by the recording mechanism. 

Indices of tendency to respond, e.g., 
Probability as defined below, always 
refer to R-classes. 

These distinctions may be clarified 
by an illustration. In the Skinner-type 
conditioning apparatus, bar-pressing is 
usually treated as an R-class. Any 
movement of the organism which re- 
sults in sufficient depression of the bar 
to actuate the recording mechanism is 
Counted as an instance of the class. 
The R-class may be subdivided into 
finer classes by the same kind of cri- 
teria. We could, if desired, treat de- 
pression of a bar by the rat’s right 
forepaw and depression of the bar by 
the left forepaw as instances of two 
different classes Provided that we have 
a recording mechanism which will be 
affected differently by the two kinds of 
movements and mediate different rela- 
tions to stimulus input (as for example 
the presentation of discriminative stim- 
uli or reinforcing stimuli). If proba- 
bility is increased by reinforcement, 
then reinforcement of a right-forepaw- 
bar-depression will increase the proba- 
bility that instances of that subclass 
will occur, and will also increase the 
Probability that instances of the broader 
class, bar-pressing, will occur. 

2. S-variables. For analytic pur- 
Poses it is assumed that all behavior 
is conditional upon appropriate stimu- 
lation. It is not implied, however, that 
responses can be predicted only when 
eliciting stimuli can be identified. Ac- 
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cording to the present point of view, 
laws of learning enable predictions of 
changes in Probability of response as a 
function of time under given environ- 
mental conditions. 

A stimulus, or stimulating situation, 
will be regarded as a finite population 
of relatively small, independent, en- 
vironmental events, of which only a 
sample is effective at any given time. 
In the following sections we shall desig- 
nate the total number of elements as- 
sociated with a given source of stimula- 
tion as S (with appropriate subscripts 
where more than one source of stimu- 
lation must be considered in an experi- 
ment), and the number of elements ef- 
fective at any given time as s. It is 
assumed that when experimental condi- 
tions involve the repeated stimulation 
of an organism by the “same stimulus,” 
that is by successive samples of ele- 
ments from an S-population, each sam- 
Ple may be treated as an independent 
random sample from S. It is to be ex- 
Pected that sample size will fluctuate 
Somewhat from one moment to the next, 
in which case s will be treated as the 
Average number of elements per sample 
Over a given period. টি 

In applying the theory, any portion 
of the environment to which the or 
Banism is exposed under uniform condi- 
tions may be considered an S-popula- 
tion. The number of different S’s said 
to be present in a situation will depend 
upon the number of independent ex- 
Perimental operations, and the degree of 
specificity with which predictions of 
behavior are to be made. If the experi- 
menter attempts to hold the stimulating 
situation constant during the course of 
an experiment, then the entire situa- 
tion will be treated as a single S. If 
in a conditioning experiment, a light 
and shock are to be independently ma- 
nipulated as the CS and US, then each 
of these sources of stimulation will be 


W. 


treated as a separate S-population, and 
SO on. 

“It should be emphasized that the 
division of environment and behavior 
into elements is merely an analytic 
device adopted to enable the applica- 
tion of the finite-frequency theory of 
Probability to behavioral phenomena. 
In applying the theory to learning ex- 
periments we shall expect to evaluate 
the ratio s/S for any specific situa 


tion from experimental evidence, but 
for the present at least no operational 
cal 


meaning can be given to a numeri 
value for either S or s taken separately. 
3. Probability of response. Proba- 
bility will be operationally defined as 
the average frequency of occurrence of 
instances of an R-class relative to the 
maximum possible frequency, under a 
specified set of experimental conditions, 
over a period of time during which the 
conditions remain constant. In accord- 
ance with customary usage the term 
Probability, although defined as a rela- 
tive frequency, will also be used to ex- 
Press the likelihood that a response will 
Occur at a given time. 
4. Conditional relation. This relation 
may obtain between an R-class and any 
number of the elements in an S-popula- 
tion, and has the following implications. 
(a) If a set of x elements from an 5S 
are conditioned to (f.¢., have the con- 


ditional relation to) some R-class, Ri, 
jlity that the 


at a given time, the probab ] 
next response to occur will be an in- 
stance of Ri is x/S. 

(b) If at a give al 
Population, , elements are conditioned 
to some R-class, Ri, and x, elements are 
conditioned to another class, Rs, then 
%, and x, have no common elements. 

(c) Tf all behaviors which may be 
evoked from an organism in 4 given 


situation have been categorized into 
sses, then the 


mutually exclusive cla Ee 
probabilities attaching to sR RUE 
classes must sum to unity at all times. 


np time in an S- 
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We consider the organism to be always 
«doing something.” If any arbitrarily 
defined class of activities may be se- 
Jected as the dependent variable of a 
given experiment, it follows that the 
activity of the organism at any time 
must be considered as subject to the 
same laws as the class under considera- 
tion. Any increase in probability of 
one R-class during learning will, then, 
necessarily involve the reduction in 
probability of other classes; similarly, 

while the probability of one R de- 
creases during extinction, the probabili- 
ties of others must increase. In other 
words, learning and unlearning will be 
considered as transfers of probability 
relations between R-classes. 

5. Conditioning. It is assumed that 
on each occurrence of a response, R;, 
all new elements (i.e., elements not al- 
ready conditioned to R,) in the mo- 
mentarily effective sample of stimulus 
elements, 5, become conditioned to R,. 

An important implication of these 
definitions is that the conditioning of a 
stimulus element to one R automatically 
involves the breaking of any pre-existing 
conditional relations with other R’s. 

6. Motivation. Experimental opera- 
tions which in the usual terminology 
are said to produce motives (e.g., food- 
deprivation) may affect either the com- 

osition of an S or the magnitude of 
the s/S ratio. Detailed discussion of 
these relations is beyond the scope of 
the present paper. In all derivations 
presented here we shall assume motivat- 
ing conditions constant throughout an 


experiment. 
7. Reinforcement. This term will be 


applied to any experimental condition 
which ensures that successive occur- 
rences Of a given R will each be con- 
tiguous with a new random sample of 
elements from some specified S-popula- 
lion. Various ways of realizing this 
definition experimentally will be dis- 


cussed in the following sections. 
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SIMPLE CONDITIONING: REINFORCE- 
MENT BY CONTROLLED ELICITATION 


Let us consider first the simplest type 
of conditioning experiment. The sys- 
tem to be described consists of a sub- 
population of stimulus elements, Sc, 
which may be manipulated independ- 
ently of the remainder of the situation, 
S, and a class, R, of behaviors defined 
by certain measurable properties. By 
means of a controlled original stimulus, 
that is, one which has initially a high 
probability of evoking R, it is ensured 
that an instance of R will occur on 
every trial contiguously with the sam- 
ple of stimulus elements which is pres- 
ent. In the familiar buzz-shock con- 
ditioning experiment, for example, S, 
would represent the population of stimu- 
lus elements emanating from the sound 
source and R would include all move- 
ments of a limb meeting certain speci- 
fications of direction and amplitude; 
typically, the R to be conditioned is a 
flexion response which may be evoked 
on each training trial by administra- 
tion of an electric shock. 

Designating the mean number of ele- 
ments from So effective on any one trial 


as sc, and the number of elements from" 


So which are conditioned to R at any 
time as 2, the expected number of new 


elements conditioned on any trial will 
be 


(S55: — %) 
Ax = RE LO - (1) 
If the change in x per trial is rela- 
tively small, and the process is assumed 
continuous, the right hand portion of 
(1) may be taken as the average rate 
of change of x with respect to number 
of trials, T, at any moment, giving 


de  (S,- zx) 
TN EE ২2) 


This differential equation may be in- 
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tegrated to yield 
= S,—-(S;—x)e, (3) 


where x, is the initial value of x, and q 
represents the ratio sc/S.. Thus x will 
increase from its initial value to ap- 
proach the limiting value, S,, in a nega- 
tively accelerated curve. A method of 
evaluating x in these equations from 
empirical measures of response latency, 
Or reaction time, will be developed in a 
later section. 

If the remainder of the situation has 
been experimentally neutralized, the 
probability of R in the presence of a 
sample from S. will be given by the 
ratio x/S.. Representing this ratio by 
the single letter p, and making appro- 
priate substitutions in (3), we have the 
following expression for probability of 
R as a function of the number of rein- 
forced trials. 


?=1-(- per. (9) 


Since we have not assumed any spe- 
cial properties for the original (or un- 
conditioned) stimulus other than that 
of regularly evoking the response to be 
conditioned, it is to be expected that 
the equations developed in this section 
will describe the accumulation of con- 
ditional relations in other situations 
than classical conditioning, provided 
that other experimental operations func- 
tion to ensure that the response to be 
learned will occur in the presence of 


every sample drawn from the S-popula- 
tion. 


OPERANT CONDITIONING: REINFORCE- 
MENT BY CONTINGENT STIMULATION 


In the more common type of experi- 
mental arrangement, various termed 
operant, instrumental, trial and error, 
etc. by different investigators, the re- 
sponse to be learned is not elicited by 
a controlled original stimulus, but has 
some initial strength in the experimental 
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situation and occurs originally as part 
of so-called “random activity.” Here 
the response cannot be evoked concur- 
rently with the presentation of each new 
stimulus sample, but some of the same 
effects can be secured by making 
changes in the stimulating situation 
contingent upon Occurrences of the re- 
sponse. Let us consider a situation of 
this sort, assuming that the activities of 
the organism have been catalogued and 
classified into two categories, all move- 
ment sequences characterized by a cer- 
tain set of properties being assigned to 
class R and all others to the class Re, 
and that members of class R are to be 
learned. 

If changes in the stimulus sample are 
independent of the organism’s behavior, 
we should expect instances of the two 
response classes to Occur, on the aver- 
age, at rates proportional to their initial 
probabilities. For if x elements from 
the S-population are originally condi- 
tioned to R, then the probability of R 
will be /S; the number of new ele- 
ments conditioned to R if an instance 
occurs will be s[(S-—x)/S], s again 
number of stimulus 
elements in a sample; and the mathe- 
matically expected increase in x will 
be the product of these quantities, 
sx[(S — »)/S°]. At the same time, 
the probability of Re will be (S — *)/S, 
and the number of new elements condi- 
tioned to Re if an instance occurs will 
be sx/S; multiplying these quantities, 
we have sx[(S — +)/S°] as the mathe- 
matically expected decrease in %. Thus 
we should predict no average change 
in x under. these conditions. 

In the acquisition phase of a learn- 
ing experiment two important restric- 
tions imposed by the experimenter tend 
to force a correlation between changes 
in the stimulus sample and occurrences 
of R. The organism is usually intro- 
duced into the experimental situation 
at the beginning of a trial, and the 


representing the 


$13 


trial lasts until the pre-designated re- 
sponse, R, occurs. For example, in a 
common discrimination apparatus the 
animal is placed on a jumping stand at 
the beginning of each trial and the trial 
continues until the animal leaves the 
stand; a trial in a runway experiment 
lasts until the animal reaches the end 
box, and so on. Typically the stimulat- 
ing situation present at the beginning 
of a trial is radically changed, if not 
completely terminated, by the occur- 
rence of the response in question; and a 
new trial begins under the same condi- 
tions, except for sampling variations, 
after some pre-designated interval. The 
pattern of movement-produced stimuli 
present during a trial may be changed 
after occurrences of R by the evocation 
of some uniform bit of behavior such as 
eating or drinking; in some cases the 
behavior utilized for this purpose must 
be established by special training prior 


to a learning experiment. In the 


Skinner box, for example, the animal is 
he sound of the 


trained to respond to t 
magazine by approaching it and eating 
or drinking. Then when operation of 
the magazine follows the occurrence of 
a bar-pressing response during condi- 
tioning of the latter, the animal’s re- 
sponse to the magazine will remove it 
from the stimuli in the vicinity of the 
bar and ensure that for an interval of 
time thereafter the animal will not be 
exposed to most of the S-population; 
therefore the sample of elements to 
which the animal will next respond may 
be considered very nearly a new random 


sample from 5. 
In the simplest operant conditioning 


it may be possible to 
t the entire stimulus sam- 
ple after each occurrence of R (com- 
plete reinforcement), while in other 
cases the sampling of only some re- 
stricted portion of the S-population is 
correlated with R (partial reinforce- 
ment). We shall consider the former 


experiments 
change almos 
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Case in some detail in the remainder of 
this section. 

By our definition of the conditional 
relation, we shall expect all R-classes 
from which instances actually occur on 
any trial to be conditioned to stim- 
ulus elements Present on that trial. 
The first movement to occur will be 
conditioned to the environmental cues 
Present at the beginning of the trial; 
the next movement will be  condi- 
tioned to some external cues, if the 
situation is not completely constant 
during a trial, and to Proprioceptive 
Cues from the first movement, and so on, 
until the Predesignated response, R, 
Occurs and terminates the trial. If 
complete constancy of the stimulating 
situation could be maintained, the most 
Probable course of events on the next 
trial would be the recurrence of the 
Same sequence of movements. In prac- 
tice, however, the sample of effective 
stimulus elements will change somewhat 
in composition, and some responses 
Which occur on one trial may fail to oc- 
Cur on the next. The Only response 
Which may never be omitted is R, since 
the trial continues until R occurs. This 
argument has been developed in greater 
detail by Guthrie (4). In order to 
verify the line of reasoning involved, 
We need now to set these ideas down in 
mathematical form and investigate the 
Possibility of deriving functions which 
will describe empirical curves of learn- 
ing. 

Since each trial lasts until R Occurs, 
We need an expression for the probable 
duration of a trial in terms of the 
strength of R. Suppose that We have 
Categorized all Movement sequences 
Which are to be counted as “responses” 
in a given situation, and that the mini- 
mum time needed for completion of a 
Tesponse-occurrence is, on the average, h. 
For convenience in the following devel- 
opment, we shall assume that the mean 
duration of instances of class R is ap- 
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Proximately equal to that of class R.. 
Let the total number of stimulus ele- 
ments available in the experimental 
situation be represented by 5S, the sam- 
ple effective on any one trial by s, and 
the ratio s/S by 9. The probability, p, 
of class R at the beginning of any trial 
will have the value %/S; if this value 
Varies little within a trial, we can readily 
compute the probable number Of ‘Tre- 
Sponses (of all classes) that will occur 
before the trial is terminated. The 
Probability that an instance of R will 
be the first response to occur on the 
trial in question is 2; the probability 
that it will be the second is b (1-2); 
the probability that it will be the third 
is p(l-— 2)°; etc. If we imagine an in- 
definitely large number of trials run un- 
der identical conditions, and represent 
the number of response occurrences on 
Any trial by m1, we may weight each pos- 
sible value of n by its probability (i.e, 
expected relative frequency) and obtain 
4 mean expected value of n. In sym- 
bolic notation we have 


fT = 2np(1 — pn = ুn(1 — p)"™!. 


The expression inside the summation 
sign will be recognized as the general 
term of a well-known infinite series with 
the sum 1/(1 —(1—7))*. Then we 
have, by substitution, 


= BO = 01 = B= 16: 


Then L, the average time per trial, will 
be the product of the expected number 
Of responses and the mean time per 
response. 


L = iih = h/p = Sh/s. 


Since R will be conditioned to all new 
stimulus elements present on each trial, 
we may substitute for % its equivalent 
from equation (3), dropping the sub- 
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scripts from S, and sc, and obtaining 


ন Sh 
= S05 — 0) 
h 
ETS DE 
Lo 


Thus, L will decline from an initial 
value of Lo (equal to Sh/x,) and ap- 
proach the asymptotic minimum value 
h over a series of trials. 

A preliminary test of the validity of 
this development may be obtained by 
applying equation (4) to learning data 
from a runway experiment in which the 
conditions assumed in the derivation are 
realized to a fair degree of approxima- 
tion. In Fig. 1 we have plotted acquisi- 
tion data reported by Graham and 
Gagné (3). Each empirical point rep- 
resents the geometric mean latency for 
a group of 21 rats which were rein- 
forced with food for traversing a simple 
elevated runway. The theoretical curve 
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obtained from published 
id Gagné (3), are fitted 
derived in the text. 


during conditioning, 
data of Graham an 
by a theoretical curve 
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in the figure represents the equation 


্ 2.5 
L= T6480 77 


where values of Lo, h, and q have been 
estimated from the data. This curve 
appears to give a satisfactory gradua- 
tion of the obtained points and, it might 
be noted, is very similar in form to the 
theoretical acquisition curve developed 
by Graham and Gagné. The present 
formulation differs from theirs chiefly 
in including the time of the first re- 
sponse as an integral part of the learn- 
ing process. The quantitative descrip- 
tion of extinction in this situation will 
be presented in a forthcoming paper. 

In order to apply the present theory 
to experimental situations such as the 
Skinner box, in which the learning pe- 
riod is not divided into discrete trials, 
we shall have to assume that the in- 
tervals between reinforcements in those 
situations may be treated as “trials” 
for analytical purposes. Making this 
assumption, we may derive an expres- 
sion for rate of change of conditioned 
response strength as a function of time 
in the experimental situation, during a 
period in which all responses of class R 
are reinforced. 

L, as defined above, will represent the 
time between any two occurrences of R. 
Then if we let t represent time elapsed 
from the beginning of the learning pe- 
riod to a given occurrence of R, and T 
the number of occurrences (and there- 
fore reinforcements) of R, we have from 
the preceding development 


Lr Sh #s 


Since L may be considered as the in- 
crement in time during a trial, we can 


write the identity 


Substituting for Ax/AT its equivalent 
from (1), without subscripts, and for 
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AT/At its equivalent from the preced- 
ing equation, we have 


S(S — x)x js 
BS: 0) 


As  S(S—#) 
KE = i AS 


If the change in x per reinforcement is 
small and the process is assumed con- 
tinuous, the right hand portion of equa- 
tion (5) may be taken as the value of 
the derivative dx/dt and integrated 
with respect to time— 


5S 
PEE 7. NEES OTE (6) 
1 + Sp, 


Xx 


where B = s/Sh. In general, this equa- 
tion defines a logistic curve with the 
amount of initial acceleration depend- 
ing upon the value of #,. Curves of 
Probability (x/S) vs. time for § = 100, 
B = 0.25, and several different values 
of x, are illustrated in Fig. 2. 
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Since we are considering a situation 
in which a reinforcement is administered 
(or a new “trial” is begun) after each 
occurrence of R, we are now in a posi- 
tion to express the expected rate of 
occurrence of R as a function of time. 
Representing rate of occurrence of R 
by 7 = dR/dt, and the ratio 1/h by w, 
Wwe have 


AR dT wz w 
f= == = = 
dt dt 5 J (5 X0) pt 
2) 
and if we take the rate of R at the be- 
ginning of the experimental period as 
70 = wx,/S this relation becomes 


w 


“e- 


70 


7 


(7) 


To illustrate this function, we have 
plotted in Fig. 3 measures of rate of 


10 15 20 
TIME 


Fic. 2. Tlustrative Curves of probability vs. time during conditioning ; 
are the same except for the initial 2-values. 


parameters of the curves 


TIME 


Fic. 3. Number of responses per m. 
single rat; the theoret 


responding during conditioning of a bar- 
pressing response by a single rat. The 
apparatus was a Skinner box; motiva- 
tion was 24 hours thirst; the animal had 
previously been trained to drink out of 
the magazine, and during the period il- 
lustrated was reinforced with water for 
all bar-pressing responses. Measures 
of rate at various times were obtained 
ber of responses 
half-minute before 


made during the 
the point 


and the half-minute after 
being considered, and taking that value 
as an estimate of the rate in terms of 
responses per minute at the midpoint. 
The theoretical curve in the figure rep- 
resents the equation 


13 


— 248° 


Ht EOE HAE 
1+ 25e¢ 

A considerable part of the variability 
of the empirical points in the figure 1s 
due to the inaccuracy of the method 
of estimating rates. In order to avoid 
this loss of precision, the writer has 
adopted the practice of using cumula- 
tive curves Of responses 0s. time for 
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20 


15 


IN MINUTES 


inute during conditioning of a bar 
ical curve is derived in the text. 


pressing habit in a 


most purposes, and fitting the cumula- 
tive records with the integral of equa- 


tion(7): 


R=ul+ log (e+e). (8) 


where R represents the number of re- 
sponses made after any interval of time, 
t, from the beginning of the learning 
period. The original record of re- 
sponses vs. time, from which the data 
of Fig. 3 were obtained, is reproduced 
in Fig. 4. Integration of the rate equa- 


tion for this animal yields 
R = 13+ 125 log (.038 + 9624-8). 


Magnitudes of R computed from this 
equation for several values of t have 
been plotted in Fig. 4 to indicate the 
goodness of fit; the theoretical curve has 
not been drawn in the figure since it 
would completely obscure most of the 


empirical record. In an experimental 
report now in press (2), equation (8) 
is fitted to several mean conditioning 
curves for groups of four rats; in all 


cases, the theoretical curve accounts for 
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FiG. 4. Reproduction of the original cu- 


mulative record from which the points of Fig. 
3 were obtained. Solid circles are computed 
from an equation given in the text. 


more than 99 per cent of the Variance 
of the observed R values. Further 
verification of the present formulation 
has been derived from that study by 
comparing the acquisition Curves of suc- 
cessively learned bar-pressing habits, 
Obtained in a Skinner-type condition- 
ing apparatus which included two bars 
differing only in Position. It has been 
found that the Parameters w and s/S 
can be evaluated from the conditioning 
Curve of one bar response, and then 
used to predict the detailed course of 
conditioning of a second learned re- 
sponse. 

The overall Accuracy of these equa- 
tions in describing the rate of condi- 
tioning of bar-pressing and runway re- 
sponses should not be allowed to Obscure 
the fact that a small but systematic 
error is present in the initial portion 
of most of the curves. It is believed 
that these disparities are due to the fact 
that experimental conditions do not 
usually fully realize the assumption that 
Only one R-class receives any reinforce- 
ment during the learning period. A 
more general formulation of the theory, 
Which does not requ 


ire this assumption, 
will be discussed in 


the next section. 


PARTIAL REINFORCEMENT 


It can be shown th: 
may be “learned” 
Situation provided 


at a given response 
in a trial and error 
that some sub-popu- 
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lation of stimulus elements is so con- 
trolled by experimental conditions that 
each sample of elements drawn from 
it is contiguous with an occurrence of 
the response. The sort of derivation 
needed to handle this kind of partial 
reinforcement will be sketched briefly 
in this section. A more detailed treat- 
ment will be given, together with rele- 
vant experimental evidence, in a paper 
now in preparation. It should be em- 
Phasized that we are using the term 
“partial” to refer to incomplete change 
of the stimulus sample on each occur- 
rence of a given response, and not to 
Periodic, or intermittent reinforcement. 

Consider a behavior system involv- 
ing two classes of competing behaviors, 
R and R,, which may occur in a situa- 
tion, S, composed of two independently 
manipulable sub-populations, S, and S,. 
Experimental conditions are to ensure 
that of the sample, s, of elements stimu- 
lating the organism at any time, ele- 
ments from S, remain effective until 
terminated by the occurrence of R, 
While elements from S, remain effective 
until terminated by the occurrence of 
Re. This kind of system might be il- 
lustrated by a Skinner box in which the 
entire stimulus sample is not terminated 
by occurrence of the bar-pressing Er 
sponse; for example, if the box is il- 
luminated, the visual stimulation will 
be relatively unaffected by bar-pressing 
but will be terminated if the animal 
closes its eyes (the latter behavior 
being, then, an instance of R.). 

Let x represent the total number of 
elements from S conditioned to R at a 
given time, x, the number of elements 
from S, conditioned to R, Tr and Ts 
the numbers of occurrences of R and 
R, prior to the time in question, and gq 
the ratio $/S. By reasoning similar to 
that utilized in deriving equations (2) 
and (5), we may obtain for the aver- 
age rate of change of x, with respect to 
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T; at any time 


dx, Sr (Sr — tn) 
ATI 5; 
= AS; — | (9) 


This may be integrated to yield 
= 505 = x0)e-err, (10) 


which is identical in form with equa- 
tion (2). 

The other component of %, (¥-— 2), 
will decrease as these elements become 
conditioned to the competing response 
class, Re, according to the following 


relations. 
U0) s(S— Sn) (x —- x) 
MRE (5-5) 
=- 0% -%) (11) 
and the integral, 
2: By = (RG a0) EEE (12) 


It will be observed that an analogous 
set of equations could be written for 
changes in the number of elements con- 
ditioned to Re, and that the argument 
could be extended to any number of 
mutually exclusive classes of responses. 
From these relations it is not difficult 
to deduce differential equations which 
may be at least numerically integrated 
to yield curves giving probability of 
occurrence of each response class as a 
function of number of reinforcements. 
We shall not carry out the derivations 
here, but shall point out a number of 
properties of the curves obtained which 
will be evident from inspection of equa- 


tions (10) and (12) 
1. Regardless of the initial probabili- 


ties, the behavior system will tend to 
a state of equilibrium in which the final 
mean probability of R vill be S/S 
and the final mean probability of Re will 
be (S — 5,)/S. 

2. Tf the number 
conditioned to R a 


of elements from S 
t the start of an 


experiment is greater than S;, the prob- 
ability of R will decrease until the 
equilibrium is reached. (Of course all 
statements made here about R have 
analogues for Re.) 

3. If the number of elements from 
5S conditioned to R at the start of an 
experiment is less than S;, the prob- 
ability of R will increase until the equi- 
librium value is reached. 

4. If all elements originally 
tioned to R belong to the sub-population 
Sr, then the curve relating probability 
to number of reinforcements will be 
identical with equation (3") except for 
the asymptote, which will be S+/S 
rather than unity. 

5. Tf some of the elements originally 
conditioned to R do not belong to Ss, 
but x, is less than Sr, then the curve 
relating probability to number of rein- 
forcements will rise less steeply at first 
than equation (3), and may even have 
an initial positively accelerated limb. 

It will be noted that from the present 
point of view, conditioning and extinc- 
tion are regarded simply as two aspects 
of a single process. In practice we 
categorize a given experiment as a 
study of conditioning or a study of 
extinction depending upon which be- 
haviors are being recorded. It seems 
quite possible that both conditioning 
and extinction always occur concur- 
rently in any behavior SyStem, and that 
the common practice of regarding them 
as separate processes is based more on 
tradition and the limitations of record- 
ing apparatus than upon rational con- 
siderations. In the present formula- 
tion, reinforcement is treated as a quan- 
titatively graded variable with “pure 
extinction” at one end of a continuum. 
Any portion of an S-population may be 
related to an R-class by experimental 
conditions which produce a correlation 
between stimulus sampling and 
occurrences. Under given conditions 
of reinforcement an R-class may in- 


condi- 
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Crease or decrease in Probability of oc- 
Currence over a series of trials depend- 
ing upon whether the momentary prob- 
ability is less than or greater than the 
equilibrium value for those conditions. 


Discussion 


The foregoing sections will Suffice to 
illustrate the manner in which problems 
of learning may be handled within the 
framework of a statistical theory. The 


assumed in the derivations of the pres- 
ent paper has been completed, and a 
report is now in press. Other Papers 
this formula- 
€0us recovery, 
d Phenomena. 
Ogram to con- 


{ aArning requires 
little comment. No attempt has been 
made to present a “new” theory. It 


r investigation to 


A thorough Study of those 
influenced the Writer’s 


in many respects. Rather 
than build directly on ei 
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theory seems to be an inevitable de- 
velopment at the present stage of the 
science of behavior; agreement on this 
Point may be found among writers of 
Otherwise widely diverse viewpoints, 
€.8., Brunswik (1), Hoagland (5), 
Skinner (7), and Wiener (9) Tt is to 
be expected that with increasing rigor 
of definition and continued interplay 
between theory and experiment, the 
Various formulations of learning will 
tend to Converge upon a common set of 
Concepts. 

It may be helpful to outline briefly 
the point of view On certain contro- 
versial issues implied by the present 
analysis. 

Stimulus-response terminology. An 
attempt has been made to overcome 
Some of the rigidity and oversimplifica- 
tion of traditional stimulus-response 
theory without abandoning its principal 
advantages. We have adopted a defini- 
tion of stimulus and response similar to 
Skinner’s (7) concept of generic classes, 
and have given it a statistical interpre- 
tation. Laws of learning developed 
within this framework refer to behavior 
Systems (as defined in the introductory 
section of this Paper) rather than to 
relations between isolated stimulus- 
response correlations. 4 

The learning curve. This investi- 
ation is not intended to be another 
Search for “the learning function. 
The writer does not believe that any 
simple function will be found to ac- 
count for learning independently of 
Particular experimental conditions. On 
the other hand, it does seem quite pos- 
sible that from a relatively small set of 
definitions and assumptions we may be 
able to derive expressions describing 
learning under Various specific experi- 
mental arrangements. 

Measures of behavior. Likelihood of 
responding has been taken as the pri- 
mary dependent variable. Analyses pre- 
Sented above indicate that simple rela- 
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tions can be derived between proba- 
bility and such common experimentally 
Obtained measures as rate of responding 
and latency. 

Laws of contiguity and effect. Avail- 
able experimental evidence on simple 
learning has seemed to the writer to 
require the assumption that temporal 
contiguity of stimuli and behavior isa 
necessary condition for the formation 
of conditional relations. At the level 
of differential analysis, that is of laws 
relating momentary changes in behav- 
ior to changes in independent variables, 
no other assumption has proved neces- 
sary at the present stage of the investi- 
gation. In order to account for the 
accumulation of conditional relations in 
favor of one R-class at the expense of 
others in any situation, we have ap- 
pealed to a group of experimental op- 
erations which are usually subsumed 
under the term “reinforcement” in cur- 
rent experimental literature. Both 
Guthrie’s (4) verbal analyses and the 
writer’s mathematical investigations in- 
dicate that an essential property of re- 
inforcement is that it ensures that suc- 
es of a given R will be 
different samples from 
the available population of stimuli. We 
have made no assumptions concerning 
the role of special properties of certain 
after-effects of responses, such as drive- 
reduction, changes in affective tone, etc. 
Thus the quantitative relations devel- 
oped here may prove useful to investi- 
gators of learning phenomena regardless 
of the investigators’ beliefs as to the 
nature of underlying Processes. 


cessive occurrenc 
contiguous with 


SUMMARY 

An attempt has been made to clarify 
some issues in current learning theory 
by giving a statistical interpretation to 
the concepts of stimulus and response 
and by deriving quantitative laws that 
govern simple behavior Systems. De- 
pendent variables, in this formulation, 
are classes of behavior samples with 
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common quantitative properties; inde- 
pendent variables are statistical dis- 
tributions of environmental events. 
Laws of the theory state probability 
relations between momentary changes in 
behavioral and environmental variables. 

From this point of view it has been 
possible to derive simple relations be- 
tween probability of response and sev- 
eral commonly used measures of learn- 
ing, and to develop mathematical ex- 
pressions describing learning in both 
classical conditioning and instrumental 
learning situations under simplified con- 
ditions. 

No effort has been made to defend 
the assumptions underlying this formu- 
lation by verbal analyses of what 
“really” happens inside the organism 
or similar arguments. It is proposed 
that the theory be evaluated solely by 
its fruitfulness in generating quantita- 
tive functions relating various phenom- 
ena of learning and discrimination. 
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STATISTICAL THEORY OF SPONTANEOUS RECOVERY 
AND REGRESSION 


W. K. ESTES: 


Indiang University 


From the viewpoint of one interested 
in Constructing a learning theory, it 
would be convenient if an Organism’ 
habits of responding with respect to 
any given situation were modifiable 
only during periods of exposure to the 
situation. In that Case, it would not 
be unreasonable, prima facie, to hope 
that all of the empirical laws of learn- 
ing could be stated in terms of rela- 
tions between behavioral and environ- 
mental variables. Nothing in psychol- 
ogy is much more certain, however, 
than that orderly changes in response 
tendencies—e.g., spontaneous recovery, 
forgetting—do occur during intervals 
when the organism and the situation 
are well separated. 

How are these “spontaneous” changes 
to be accounted for? It is easy enough 
to construct a law expressing some be- 
havioral measure as a function 
but an unfilled temporal interv 
remains permanently 
explanatory Variable. The temporal 
Eap bas to be filled with events of some 
Sort, observed or inferred, in the envi- 
Tonment or in the Organism. The fa- 
Vorite candidate for the intervening po- 
sition has usually been a postulated 
State or Process, either neural Or purely 
hypothetical, Which varies Spontane- 


5 


of time, 
al never 
Satisfying as an 
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ously during rest intervals in whatever 
manner is required to account for the 
behavioral changes. The difficulty with 
this type of construct is that it is al- 
Ways much easier to postulate than to 
unpostulate. Few hypothetical entities 
are so ill-favored that once having se- 
Cured a foothold they cannot face out 
each new turn of empirical events with 
the aid of a few ad hoc assumptions. 

The approach to time-dependent learn- 
ing phenomena which will be illustrated 
in this paper attempts to shift the bur- 
den of explanation from hypothesized 
Processes in the organism to statistical 
Properties of environmental events. The 
very extensiveness of the array of hy- 
pothetical constructs—e.g., set, reactive 
inhibition, memory trace—which now 
compete for attention in this area sug- 
Eests that postulates of this type DANE 
entered the scene prematurely. Until 
more parsimonious explanatory Var 
ables have been fully explored; it will 
scarcely be possible either to “define 
clearly the class of problems which re- 
quire explanation or to evaluate the 
Various special hypotheses that have 
been proposed. 

By “more parsimonious” sources of 
explanation, I refer to the variables, 
ordinarily stimulus variables, which os 
intrinsic to a Biven type of behaviora 
situation and thus must be expected to 
Play a role in any interpretive schema. 


Reprinted with permission. 
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In the present instance we are inter- 
ested specifically in the way learned 
response tendencies change during rest 
intervals following experimental peri- 
ods. And we note that there are two 
principal ways in which stimulus vari- 
ables could lead to modification in re- 
sponse tendencies during rest intervals. 
The first is the direct effect that changes 
in the stimulus characteristics of ex- 
perimental situations from trial to trial 
or period to period may have upon re- 
sponse probability. The second is the 
learning that may occur between peri- 
ods if the stimulating situations obtain- 
ing within and between periods have 
elements in common. The former cate- 
gory can again be subdivided according 
as the environmental variation is Sys- 
tematic or random. 

The random component has been se- 
lected as our first subject of investiga- 
tion for several reasons. One is that it 
has received little attention heretofore 
in learning theory. Another is that in 
other sciences apparently spontaneous 
changes in observables have frequently 
turned out to be attributable to random 
processes at a more molecular level. 
Perhaps not surprisingly, considerable 
analysis has been needed in order to 
ascertain how random environmental 
fluctuations during intervals of rest 
following learning periods would be ex- 
pected to influence response probabili- 
ties. It will require the remainder of 
this paper to summarize the methods 
and results of this one phase of the 
over-all investigation. 


GENERAL THEORY OF STIMULUS 
FLUCTUATION 


0 a detailed analysis, we 
that whenever environ- 
ccurs, the prob- 


Even prior t 
can anticipate 
mental fluctuation ©! 
ability of a response at the end of one 
experimental period will not be the 


same as the probability at the begin- 
If conditioning is 


ning of the next. 
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carried out during a given period, some 
of the newly conditioned stimulus ele- 
ments 2 will be replaced before the next 
period by elements which have not 
previously been available for condition- 
ing. Similarly, during the interval fol- 
lowing an extinction period, random 
fluctuation will lead to the replacement 
of some of the just extinguished stimu- 
lus elements by others which were sam- 
pled during conditioning but have not 
been available during extinction. In 
either case, the result will be a pro- 
gressive change in response probability 
as a function of duration of the rest 
interval. 

In order to make these ideas testable, 
we must state more formally and ex- 
plicitly the concepts and assumptions 
involved. Once this is done, we will 
have in effect a fragmentary theory, or 
model, which may account for certain 
apparently spontaneous changes in re- 
sponse tendencies. At a minimum, this 
formal model will enable us to derive 
the logical consequences of the concept 
of random environmental fluctuation so 
that they may be tested against experi- 
mental data. If the correspondence 
turns out to be good, we may wish to 
incorporate this model into the concep- 
tual structure of S-R learning theory, 
viewing it as a limited theory which ac- 
counts for a specific class of time-de- 
pendent phenomena. 

Most of the assumptions we shall re- 
quire have been discussed elsewhere 
(8) and need only be restated briefly 
for our present purposes. 

a. Any environmental situation, as 
constituted at a given time, determines 
for a given organism a population of 


2 For reasons of mathematical simplicity 
and convenience I shall develop these ideas 
in terms of the concepts of statistical learn- 
ing theory. It will be apparent, however, 
that within the Hullian system a similar 
argument could be worked out in terms of 
the fluctuation of stimuli along generalization 


continua. 
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stimulus events from which a sample 
affects the organism’s behavior at any 
instant; in statistical learning theories 
the population is conceptualized as a 
set of stimulus elements from which a 
random sample is drawn on each trial. 

b. Conditioning and extinction occur 
only with respect to the elements sam- 
pled on a trial. 

Cc. The behaviors available to an or- 
ganism in a given situation may be 
categorized into mutually exclusive and 
exhaustive response classes. 

d. At any time, each stimulus ele- 
ment in the population is conditioned 
to exactly one of these response classes. 

On the basis of these assumptions, 
functions have been derived by various 
investigators (2, 5, 8, 16, 21) to de- 
scribe the course of learning predicted 
for an idealized situation in which the 
physical environment is perfectly con- 
stant and the organism samples the 
stimulus population on each trial. No 
idealized situations are available for 
testing purposes, but the theory seems 
to give good approximations to em- 
pirical learning functions obtained in 
short experimental periods under well- 
controlled conditions. 

In the present paper we turn our at- 
tention from behavioral changes that 
Occur within experimental periods to 
the changes that occur as a function of 
the intervals between periods. Corre- 
spondingly, we replace the simplifying 
assumption of a perfectly constant 
situation with the assumption of a ran- 
domly fluctuating situation.3 Specifi- 
cally, it will be assumed that the avail- 
ability of stimulus elements during a 
given learning period depends upon a 
large number of independently variable 
components or aspects of the environ- 


8It is possible now to go back and “cor- 
rect” the functions derived earlier to allow 
for this random variation, but we will not 
be able to go into this point in the present 
paper. 
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mental situation, all of which undergo 
constant random fluctuation. 

Now let us consider the type of ex- 
periment in which an organism is run 
for more than one period in the same 
apparatus. In dealing with the behav- 
ior that occurs during any given experi- 
mental period, the total population S+ 
of stimulus elements available in the 
situation at any time during the ex- 
periment can be partitioned into two 
portions: the subset S of elements 
which are available during that period 
and the subset 5S’ of elements which 
are not. Under the conditions consid- 
ered in this paper, the probability of a 
response at any given time during the 
period is equal to the proportion of ele- 
ments in the available set S that are 
conditioned to that response. Owing to 
environmental fluctuation, there is some 
Probability j that an element in the 
available set S will become unavailable, 
i.e., go into S$, during any given in- 
terval At, and a probability j' that an 
element in S’ will enter S. These ideas 


are illustrated in Fig. 1 for a hypotheti- 
cal situation. 


End Ext. 
pO 


Init State t Final State 
pl Res p* 75 


Fic. 1. Fluctuations in stimulus sets dur- 
ing spontaneous regression (upper panel) and 
spontaneous recovery from extinction (lower 
Panel). Circles represent elements connected 
to response A. Values of 2 represent prob- 
abilities of response A in the available set 5. 
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The relevance of the scheme for 
learning phenomena arises from the 
fact that both conditioned and uncon- 
ditioned elements will constantly be 
fluctuating in and out of the available 
set S. During an experimental period 
in which conditioning or extinction oc- 
curs, the proportion of conditioned ele- 
ments in S§ will increase or decrease 
relative to the proportion in 5’; But 
during a subsequent rest interval, these 
proportions will tend toward equality as 
a result of the fluctuation process. 


INTERPRETATION OF SPONTANEOUS 
RECOVERY AND REGRESSION * 


The essentials of our treatment of 
spontaneous recovery and regression 
will be clear from an inspection of 
Fig. 1. The upper panel illustrates a 
case in which, starting from a zero 
level, conditioning of a given response 
A is carried out during one period until 
the probability of 4 in the available 
situation represented by the set Si 
unity. At the end of the conditioning 
period we will have, neglecting any 
fluctuation that may have occurred 
during the period, all of the elements 
in S conditioned to A and all of the 


temporarily unavailable elements in S$’ 


unconditioned. During the first inter- 
val At of the ensuing rest interval, the 
proportion j= 6 of the conditioned 
elements will escape from 5S, being re- 
placed by the proportion j' = .2 of the 
unconditioned elements from S$’. Dur- 
ing further intervals the interchange 
will continue, at a progressively de- 
creasing rate, until the system arrives 
at the final state of statistical equilib- 


4The term spontaneous regression will be 
used here to refer to any decrease In response 
probability which is attributable solely to 


stimulus fluctuation. 


short time intervals, '] pheno: 
non of forgetting may be virtually identified 


with regression, ut that over longer intervals 
forgetting is influenced to an inc 
by effects of interpolated learning. 
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rium in which the densities of condi- 
tioned elements in S and 5’ are equal. 
The predicted course of spontaneous 
regression in terms of the proportion 
of conditioned elements that will be in 
S at any time following the condition- 
ing period is given by the topmost 
curve in the upper panel of Fig. 2. 
The equation of the curve will be de- 
rived in a later section. 

In an analogous fashion the essentials 
of the spontaneous recovery process are 
schematized in the lower panel of Fig. 
1. We begin at the left with a situa- 
tion following maximal conditioning SO 
that all elements are conditioned to re- 
sponse A. During a single period of 
extinction, all elements in the available 
set S are conditioned to the class of 
competing responses A and the prob- 
ability of A goes temporarily to zero. 
Then during a recovery interval, the 
random interchange of conditioned and 
unconditioned elements between S and 
S$’ results in a gradual increase in the 
proportion of conditioned elements in 
S until the final equilibrium state is 
reached. The predicted course of spon- 
taneous recovery as a function of time 
is given by the topmost curve in the 
lower panel of Fig. $. 

According to this analysis, spontane- 
ous regression and recovery are to be 
regarded as two aspects of the same 
In each case the form of the 
process is given by a negatively ac- 
celerated curve with the relative rate of 
change depending solely upon the char- 
acteristics of the physical situation em- 
bodied in the parameters jand j'. Rates 
of regression and recovery should, then, 
vary together whenever the variability 
of the stimulating situation is modified. 

It cannot be assumed, however, that 
amounts of regression and recovery 
should be equal and opposite in all ex- 
periments. The illustrative example of 
Fig. 1 meets two special conditions that 
do not always hold: (a) the condition- 


process. 
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Fic. 2. Families of spontaneous regression 
curves. In the upper panel the proportion of 
conditioned elements in S’ at the end of con- 
ditioning is zero and the proportion in 5S is 
the parameter. In the lower panel the pro- 
portion of conditioned elements in S at the 


end of conditioning is unity and the propor- 
tion in S' is the parameter. 


ing and extinction series start from 
initial response probabilities of zero 
and unity, respectively; and (6) con- 
ditioning and extinction are carried to 
comparable criteria within the experi- 


mental period preceding the rest in- 
terval. 


PREDICTIONS CONCERNING EFFECTS 
OF EXPERIMENTAL VARIABLES 

Terminal level of conditioning or ex- 
tinction. If other conditions remain 
fixed, the level of response probability 
attained at the end of a single learning 
period will determine both the initial 
value and the asymptote of the Curve 
of regression or recovery. For the 
situation represented by the upper 
panel of Fig. 1, the curve of condition- 
ing goes to unity, and the predicted 


course of spontaneous regression is 


given by the top curve in the upper 
panel of Fig. 2. If in the same situa- 
tion, conditioning has been carried only 
to a probability level of, say, .67, then 
the total number of conditioned ele- 
ments will be smaller and the curve of 
regression will not only start at a lower 
value, but will run to a lower asymp- 
tote, and so on. Similarly, if in the 
situation represented by the lower panel 
of Fig. 1, response probability goes to 
zero during the extinction period, the 
predicted course of spontaneous re- 
Covery is given by the lowest curve in 
the upper panel of Fig. 3; if extinction 
terminates at higher probability levels, 
we obtain the successively higher re- 
covery curves shown in the figure. 

Number of preceding learning peri- 
ods. Increasing the number of preced- 
ing acquisition periods would tend to 
increase the total number of condi- 
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Fic. 3. Families of spontaneous EN 
curves. In the upper panel the proportion © 
conditioned elements in 5S’ at the end of ex- 
tinction is unity and the proportion of con- 
ditioned elements in S at the end of ei 
tion is the parameter. In the lower pane, 
the proportion of conditioned elements in d 
at the end of extinction is the parameter an 

the proportion in S is zero. 
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tioned elements in S# and therefore the 
asymptote of the curve of regression. 
If level of response probability at the 
end of the last acquisition period is 
fixed at some one value, say unity, then 
variation in the proportion of condi- 
tioned elements in S’ yields the family 
of regression curves illustrated in the 
lower panel of Fig. 2, all curves start- 
ing at the same point but diverging to 
different asymptotes. This curve family 
will be recognized as corresponding to 
the well-known relationship between re- 
tention and amount of overlearning, 
where overlearning is defined in terms 
of additional training beyond the point 
at which response probability in the 
temporarily available situation reaches 
unity. 

Analogous considerations apply in the 
case of spontaneous recovery. Increas- 
ing the number of preceding extinction 
periods would tend to decrease the pro- 
portion of conditioned elements remain- 
ing in S$’ at the end of extinction and 
thus the asymptote of the curve of 
spontaneous recovery, as illustrated in 
the lower panel of Fig. 3. On the other 
hand, increasing the number of con- 
ditioning periods prior to extinction 
would tend to increase the density of 
conditioned elements in 5" and thus the 
asymptote of the curve of recovery fol- 
lowing a period of extinction. 

The experimental phenomenon of 
«extinction below zero” corresponds to 
a case in which additional extinction 
trials are given beyond the point at 
which temporary response probability 
first reaches Zer0. The results of this 
procedure will clearly depend upon the 
conditioning history. Consider, for ex- 
ample, the situation illustrated in the 
top row of Fig df extinction were 
begun immediately following the condi- 
tioning period, then we would expect 
extinction below zero to have little ef- 
fect, for at the end of the first extinc- 
tion period the set S would be ex- 
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hausted of conditioned elements and 
there would be few or none in S' to 
fluctuate back into S during further pe- 
riods of extinction. If, however, ex- 
tinction began long enough after the 
end of the acquisition period so that an 
appreciable number of conditioned ele- 
ments were in S' during the first ex- 
tinction period, the additional extinc- 
tion would further reduce the total 
number of conditioned elements and 
thus increase the amount of training 
that would be required for recondi- 
tioning. If conditioning extended over 
more than one period, then there would 
be conditioned elements in S$" at the 
end of conditioning, and similar effects 
of extinction below zero would be ex- 
pected even if extinction began im- 
mediately after the last conditioning 
period. 

Distribution of practice. In gen- 
eral, amount of spontaneous regression 
should vary inversely with duration of 
the intertrial interval during condition- 
ing, and spontaneous recovery should 
vary inversely with duration of the in- 
tertrial interval during extinction. In 
each case, the length of the intertrial 
interval will determine the extent to 
which the stimulating situation can 
change between trials, and thus the 
proportion of the elements in the stimu- 
lus population S+ which will be sam- 
pled during a given number of trials. 
These relationships will be treated in 
more detail in a forthcoming paper (7). 


MATHEMATICAL DEVELOPMENT OF 
FLUCTUATION THEORY 


Stimulus fluctuation model. Let the 
probability that any given element of 
a total set S; is in the available set S 
at time t be represented by f(t), the 
probability that an element in S es- 
capes into the unavailable set S’ during 
a time interval At by j, and the prob- 
ability that an element in S’ enters 5S 
during an interval At by j. Then by 
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elementary probability theory we have 
for the probability that an element is 
in S at the end of the (t + 1)st inter- 
val At following an experimental pe- 
riod: 


SO = LE FON RIC =I 


This difference equation can be solved 

by standard methods (2, 12) to yield 

a formula for f(t) in terms of t and 

the parameters: 

AI 

0=7H 

4 
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=J-—[J-—{f(0)]a' [1] 


where {(0) is the initial value of {(£); 
J represents the fraction j'/j + j'; and 
a represents the quantity (1~ jj). 
Since a is bounded between — 1 and 
+ 1 by the definition of j and j’, the 
Probability that any elements is in S 
will settle down to the constant value J 
after a sufficiently long interval of time, 
and the total numbers of elements in S 
and S’ will stabilize at mean values N 


and N', respectively, which satisfy the 
relation. 


N = J(N + N). [2] 


Spontaneous recovery and regression. 
Curves of spontaneous recovery and re- 
gression can now be obtained by ap- 
propriate application of Equation 1. 


© For simplicity, it has been assumed in this 
paper that all of the elements in S+ have the 
same values of j and j!. In dealing with 
some situations it might be more reasonable 
to assume that different parameter values are 
associated with different elements. For ex- 
ample, data obtained by Homme (11) sug- 
gest that in the Skinner box a portion of the 
elements should be regarded as fixed and al- 
ways available while the remainder fluctuate. 
Application of an analytic method described 
elsewhere (8) shows that conclusions in the 
general case will differ only quantitatively 
from those given in this paper. 
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Let us designate by p(t) and #'(t) 
the proportions of conditioned elements, 
and therefore the response probabilities, 
in S and 5S’ respectively at time t fol- 
lowing an experimental period. The set 
of conditioned elements in S at time t 
will come in part from the conditioned 
elements, 2(0)N in number, that were 
in S at the end of the experimental pe- 
riod, and in part from the conditioned 
elements, #'(0)N'’ in number, that were 
in S’. The probabilities of finding ele- 
ments from these two sources in S at 
time t are obtained from Equation 1 
by setting f(0) equal to 1 and 0 re- 
spectively. With these relations at 
hand we are ready to write the general 
expression for spontaneous recovery and 
regression: 


20) = LPO - (T- Dad N 
+ 2'(0)J0 — a)N'] 
= POL = = 1a 
+ #00 a) - ND, [3] 


the parameters N and N' having been 
eliminated by means of Equation 2. 

The functions illustrated by the curve 
families of Fig. 2 and 3 are all special 
cases of Equation 3. In the upper 
panel of Fig. 2, #'(0) has been set 
equal to 0; in the lower panel, (0) 
has been set equal to 1. In the upper 
panel of Fig. 3, 2'(0) has been set 
equal to 1; in the lower panel, (0) 
has been set equal to 0. 


EMPIRICAL RELEVANCE AND ADEQUACY 


General considerations. The theo- 
retical developments of the preceding 
sections present two aspects, one gen- 
eral and one specific, which are by no 
means on the same footing with regard 
to testability. It will be necessary to 
discuss separately the general concept 
of stimulus fluctuation and the specific 
mathematical model utilized for pur- 
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poses of deriving its testable conse- 
quences. 

The reason why the fluctuation con- 
cept had to be incorporated into a for- 
mal theory in order to be tested was, 
of course, the difficulty of direct ob- 
servational check. Thus for the pres- 
ent this concept must be treated with 
the same reserve and even suspicion 
as any interpretation which appeals to 
unobservable events. This remoteness 
from direct observation may, however, 
represent only a transitory stage in the 
development of the theory. Relatively 
direct attacks upon certain aspects of 
the stimulus element concept are pro- 
vided by recent experiments CL; 21) mn 
which the sampling of stimulus popula- 
tions has been modified experimentally 
and the outcome compared with theo- 
retical expectation. Further, it should 
be noted that the idea of stimulus 
fluctuation is well grounded in physi- 
cal considerations. Surely no one would 
deny that stimulus fluctuation must oc- 
cur continuously; the only question is 
whether fluctuations are large enough 
under ordinary experimental conditions 
to yield detectable effects upon behav- 
ior The surmise that they are is not 
a new one; the idea of fluctuating en- 
vironmental components has been used 
in an explanatory Sense by a number 
of investigators in connection with par- 
ticular problems: €.8., by Pavlov (19) 
and Skinner (22) in accounting for 
perturbations in curves of conditioning 
or extinction, by Guthrie (10) in ac 
counting for the effects of repetition, 
and recently by Saltz (20) in account- 
ing for disinhibition and reminiscence. 

Considered in isolation, the concept 
of stimulus fluctuation is not even iIn- 
directly testable; it must be incorpo- 
rated into some broader body of theory 
before empirical consequences can be 
derived. In the present paper we have 
found that when this concept is taken 
in conjunction with other concepts and 
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assumptions common to contemporary 
statistical learning theories (2, 5, 8, 
16), the result of the union is a mathe- 
matical model which yields a large num- 
ber of predictions concerning changes 
in response probability during rest in- 
tervals. Once formulated, this model 
is readily subject to experimental test. 
Its adequacy as a descriptive theory of 
spontaneous recovery and regression can 
be evaluated quite independently of the 
merits of the underlying idea of stimu- 
lus fluctuation. 

Spontaneous recovery. Space does 
not permit the detailed discussion of 
experimental studies, and we shall have 
to limit ourselves to a brief summary of 
empirical relationships derivable from 
the theory, together with appropriate 
references to the experimental litera- 
ture. To the best of my knowledge, 
the references cited include all studies 
which provide quantitative data suit- 
able for comparison with predicted 
functions. 

a. The curve of recovery is exponential in 


form (3, 9, 17) with the slope independent 
of the initial value (3). 

b. The asymptote of recovery is inversely 
related to the degree of extinction (3, 11). 

c. The asymptote of recovery is directly 
related to the number of conditioning peri- 
ods given prior to extinction (11). 

d. The asymptote of recovery is directly 
related to the spacing of preceding condition- 
ing periods (11). 

e. Amount of recovery progressively de- 
creases during a series of successive extinc- 
tion periods (4; 13; 19, p. 61). 


It may be noted that items c and d 
represent empirical findings growing out 
of a study conducted expressly to test 
certain aspects of the theory. Many 
additional predictions derivable from 
the theory must remain unevaluated 
until appropriate experimental evidence 
becomes available, e.g., the inverse re- 
lation between asymptote of recovery 
and spacing of extinction trials or peri- 
ods, and the predictions concerning “ex- 
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tinction below zero” mentioned in a 
previous section. 

Spontaneous regression. Predictions 
concerning functional relationships be- 
tween spontaneous regression and such 
experimental variables as trial spacing 
or degree of learning parallel those 
given above for spontaneous recovery, 
but in the case of regression there are 
fewer data available for purposes of 
verification. The predicted exponential 
decrease in amount of regression as a 
function of number of preceding learn- 
ing periods has been observed in sev- 
eral studies (6, 11, 13, 14). Predic- 
tions concerning regression in relation 
to spacing of learning periods have not 
been tested in conditioning situations, 
but they seem to be in agreement with 
rather widely established empirical re- 
lationships between spacing and reten- 
tion in human learning (15, pp. 156- 
158; 18, p. 508). 

Finally, the question may be raised 
whether there are no experimental facts 
that would embarrass the present the- 
Ory. Tf a claim of comprehensiveness 
had been made for the theory, then 
negative instances would be abundantly 
available. Under some conditions, for 
example, recovery or regression fails to 
appear at all following extinction or 
conditioning, respectively. Since, how- 
ever, we are dealing with a theory that 
is limited to effects of a single inde- 
pendent variable, stimulus fluctuation, 
instances of that sort are of no special 
significance. Like any limited theory, 
this one can be tested only in situa- 
tions where suitable measures are taken 
and where the effects of variables not 
represented in the model are either 
negligible or else quantitatively pre- 
dictable. And subject to these qualifi- 
cations, available evidence Seems to be 
uniformly confirmatory. The danger 
of continually evading negative evi- 
dence by ad hoc appeals to other vari- 
ables cannot be entirely obviated, but 
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it may be progressively reduced if we 
are successful in bringing other relevant 
independent variables into the theoreti- 
cal fold by further applications of the 
analytical method illustrated here. 


SUMMARY 


In this paper we have investigated 
the possibility that certain apparently 
spontaneous behavioral changes, e.g., 
recovery from extinction, may be ac- 
counted for in terms of random fluctua- 
tion in stimulus conditions. Taken in 
isolation, the concept of random stimu- 
lus fluctuation has proved untestable, 
but when incorporated into a model it 
has led to quantitative descriptions of 
a variety of already established em- 
Pirical relationships concerning spon- 
taneous recovery and regression and to 
the determination of some new ones. 
A forthcoming paper in which the same 
model is applied to the problem of dis- 
tribution of practice will provide fur- 
ther evaluation of its scope and useful- 


ness in the interpretation of learning 
Phenomena. 


REFERENCES 


1. Burg, C. J., Estes, W. K., & HELLYER, 
5S. Rate of verbal conditioning in re- 
lation to stimulus variability. J. exp. 
Psychol, 1954, 48, 153-161. Lo 

2. BusH, R. R, & MosTELLER, F. Stochastic 
models for learning. New York: Wiley, 
in press. 

+ ELtsow, D. G. Quantitative studies of the 
interaction of simple habits: I. Recov- 
ery from specific and generalized ef- 
fects of extinction. J. exp. Psychol, 
1938, 23, 339-358. 

* Erisow, D. G. Successive extinctions of 
2 bar-pressing response in rats. J. gen. 
Psychol, 1940, 23, 283-288. 

5. Estes, W. K. Toward a statistical theory 
of learning. Psychol. Rev., 1950, 57, 
94-107. 

. EsSTEs, W. K. Effects of competing reac- 
tions on the conditioning curve for bar 


pressing. J. exp. Psychol, 1950, 40, 
200-205. 2: 
7. Estes, W. K. Statistical theory of dis- 
tributional Phenomena in learning. 


Psychol. Rev., in press. 


10. 


£; 


12. 


13. 


14. 


. Estes, W. K., & BURKE, 


+ GRAHAM, C. H., 


331 


W. K. ESTES 


C. J. A theory 
variability in learning. 
1953, 60, 276-286. 

& GAGNE, R. M. The 
acquisition, extinction, and spontaneous 
recovery of a conditioned operant re- 
J. exp. Psychol, 1940, 26, 


of stimulus 
Psychol. Rev., 


sponse. 
251-280. 

GurnRIE, E. R. The psychology of learn- 
ing. New York: Harper, 1952. 

HoMME, L. E. Spontaneous recovery 
from extinction in relation to number 
of reinforcements, spacing of acquisi- 
tion, and duration of initial extinction 
period. Unpublished Ph.D. thesis, In- 
diana Univer., 1953. 

Jorpax, C. Calculus of finite differences. 
New York: Chelsea, 1950. 

LAUER, D. W., & EsrEs, W. K. Succes- 
sive acquisitions and extinctions of a 
jumping habit in relation to schedule 
of reinforcements. J. comp. physiol. 
Psychol, 1955, 48, 8-13. 

LAUER, D. W., & ESTES, W. K. Rate of 
learning successive discrimination re- 
versals in relation to trial spacing. 
Amer. Psychologist, 1953, 8, 384. (Ab- 


stract) 


21. SCHOEFFLER, 


22. SKINNER, 


L. The 


15. McGEocs, J. As & Irtox, A. 
New 


psychology of human learning. 
York: Longmans, Green, 1952. 

16. MILLER, G. A. & McGnL, W. J. A sta- 
tistical description of verbal learning. 
Psychometrika, 1952, 17, 369-396. 

17. MILLER, N. E., & STEVENSON, S.S. Agi- 
tated behavior of rats during experi- 
mental extinction and a curve of spon- 
taneous recovery. J. comp. Psychol, 
1936, 21, 205-231. 

18. Oscoov, C. E. Method and theory in ex- 
perimental psychology. New York: 
Oxford Univer. Press, 1953. 

19. PAVLOV, IL. P. Conditioned 
(Trans. by G. V. Anrep.) 
Oxford Univer. Press, 1927. 

20. SALTZ, E. A single theory for reminis- 
cence, act regression, and other phe- 
nomena. Psychol. Rev., 1953, 60, 159- 


171. 


reflexes. 
London: 


Probability of re- 
ds of discriminated 
1954, 48, 


M. 5S. 
sponse to compoun: 
stimuli. J. exp. Psychol, 


323-329. 
B..F. The behavior of organ- 


isms. New York: Appleton-Century- 
Crofts, 1938. 


(Received April 18, 1954) 


A THEORY OF STIMULUS VA 


RIABILITY IN LEARNING! 


W. K. ESTES AND C. J. BURKE 


Indiana University 


There are a number of aspects of the 
stimulating situation in learning experi- 
ments that are recognized as important 
by theorists of otherwise diverse view- 
points but which require explicit rep- 
resentation in a formal model for ef- 
fective utilization. One may find, for 
example, in the writings of Skinner, 
Hull, and Guthrie clear recognition of 
the statistical character of the stimulus 
concept. All conceive a stimulating 
situation as made up of many compo- 
nents which vary more or less inde- 
pendently. From this locus of agree- 
ment, strategies diverge. Skinner (17) 
incorporates the notion of variability 
into his stimulus-class concept, but 
makes little use of it in treating data. 
Hull states the concept of multiple com- 
ponents explicitly (13) but proceeds to 
write postulates concerning the condi- 
tions of learning in terms of single 
components, leaving a gap between the 
formal theory and experimentally de- 
fined variables. Guthrie (11) gives 
verbal interpretations of various phe- 
nomena, e.g., effects of repetition, in 
terms of stimulus variability; these in- 
terpretations generally appear plausi- 
ble but they have not gained wide ac- 
ceptance among investigators of learn- 
ing, possibly because Guthrie’s assump- 
tions have not been formalized in a 
way that would make them easily used 


1 This paper is based upon a paper reported 
by the writers at the Boston meetings of the 
Institute of Mathematical Statistics in Decem- 
ber 1951. The writers’ thinking along these 
and related lines has been stimulated and their 
research has been facilitated by participation 
in an interuniversity seminar in mathematical 
models for behavior theory which met at 
Tufts College during the summer of 1951 and 
was sponsored by SSRC. 


This article appeared in Psychol. Rev., 1953 
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by others. Statistical theories of learn- 
ing differ from Hull in making stimulus 
variability a central concept to be used 
for explanatory purposes rather than 
treating it as a source of error, and they 
go beyond Skinner and Guthrie in at- 
tempting to construct a formalism that 
will permit unambiguous statements of 
assumptions about stimulus variables 
and rigorous derivation of the con- 
sequences of these assumptions. 

It has been shown in a previous 
Paper (7) that several quantitative as- 
pects of learning, for example the ex- 
Ponential curve of habit growth regu- 
larly obtained in certain conditioning 
experiments, follow as consequences of 
statistical assumptions and need not be 
accounted for by independent postu- 
lates. All of the derivations were car- 
ried out, however, under the simplifying 
assumption that all components of a 
stimulating situation are equally likely 
to occur on any trial. By removing 
that restriction, we are now in a posi- 
tion to generalize and extend the theory 
in several respects. It will be possible 
to show that regardless of whether as- 
Sumptions as to the necessary condi- 
tions for learning are drawn from con- 
tiguity theories or from reinforcement 
theories, certain characteristics of the 
learning process are invariant with re- 
spect to stimulus properties while other 
characteristics depend in specific ways 


upon the nature of the stimulating 
situation. 


THE GENERALIzeD SET MoDEL: 
ASSUMPTIONS AND NOTATION 


The exposure of an organism to a 
stimulating situation determines a set 


: 60, 276-286. Reprinted with permission. 
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of events referred to collectively as 
stimulation. These events constitute 
the data of the various special disci- 
plines concerned with vision, audition, 
etc. We wish to formulate our model 
of the stimulus situation so that infor- 
mation from these special disciplines 
can be fed into the theory, although 
utilization of that information will de- 
pend upon the demands of learning ex- 


periments. 
For the present we shall make only 
the following very general assumptions 
about the stimulating situation: (a) 
The effect of a stimulus situation upon 
an organism may be regarded as made 
up of many component events. (b) 
When a situation is repeated on a series 
of trials, any one of these component 
stimulus events may occur on some 
trials and fail to occur on others; as a 
first approximation, at least, the rela- 
tive frequencies of the various stimulus 
events when the same situation (as de- 
fined experimentally) occurs on a series 
of trials, may be represented by inde- 
pendent probabilities. We formulate 
these assumptions conceptually as fol- 


lows: 

(a) With any given organism we as- 
sociate a set S* of N* elements. The 
N* elements of S* are to represent all 
of the stimulus events that can occur 
in that organism in any situation what- 
ever with each of these possible events 
corresponding to an element of the set. 
(b) For any reproducible stimulating 
situation we assume 8 distribution of 
values of the parameter 0; we represent 
by 0; the probability that the stimulus 
event corresponding to the i element 


of S* occurs on any given trial. 


el, various sets will be desig- 
nated by the letter 5, accompanied by AP 
priate subscripts and superscripts. The let 
N, with the same arrangement of subscripts 
and superscripts, always denotes the size 0 
the set. 


2 In the sequ 
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It is assumed that any change in the 
situation (and we shall attempt to deal 
only with controlled changes correspond- 
ing to manipulations of experimental 
variables) determines a new distribu- 
tion of values of the 0;. By repeating 
the “same” situation, we mean the same 
as described in physical terms, and we 
recognize that, strictly speaking, repeti- 
tion of the same situation refers to an 
idealized state of affairs which can be 
approached by increasing experimental 
control but possibly never completely 
realized. 

It is recognized that some sources of 
stimulation are internal to the organ- 
ism. This means that in order to have 
a reproducible situation in a learning 
experiment it is necessary to control 
the maintenance schedule of the or- 
ganism and also activities immediately 
preceding the trial. In the present 
paper we shall not use the term “trial” 
in a sufficiently extended sense to neces- 
sitate including in the 0 distribution 
movement-produced-stimulation arising 
from the responses occurring on the 
trial. 

We have noted that the behavior on 
a given trial is assumed to be a function 
of the stimulus elements which are 
sampled on that trial. If in a given 
situation certain elements of S* have a 
probability 9=0 of being sampled, 
those elements have a negligible effect 
upon the behavior in that situation. 
For this reason we often represent a 
specific situation by means of a re- 
duced set S. An element of S* is in S 
if and only if it has a non-zero value of 
0 in the given situation. These sets 
are represented in Fig. 1. In this con- 
nection, we must note that a prob- 
ability of zero for a given event does 
not mean that the event can never 
occur “accidentally”; this probability 
has the weaker meaning that the rela- 
tive frequency of occurrence of the 
event is zero in the long run. For a 
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more detailed explication of this point 
the reader is referred to Cramér (5). 

It should be clearly understood that 
the probability, 6, that a given stimulus 
event occurs on a trial may depend 
Upon many different environmental 
events. For example, a stimulus event 
associated with visual stimulation may 
depend for its Probability upon several 
different light sources in the environ- 
ment. Suppose that for a given stimu- 
lus element, the associated probability 
9 in a given situation depends only 
upon two separately manipulable com- 
ponents of the environment, a and 0, 
and that the probabilities of the ele- 
ment’s being drawn if only 0 or b alone 
were present are 6, and 0, respectively. 
Then the probability attached to this 
element in the situation with both com- 
ponents present will be 


0 = 06, + 0, — 005. 


READINGS IN MATHEMATICAL PSYCHOLOGY 


imulus elements, the stimulus space S*, the reduced 
zero § values for a given stimulating situation, and the re- 


THE RESPONSE MopDEL 


The response model formulated in a 
Previous paper (7) will be used here 
Without any important modification. 
We shall deal only with the simple case 
of two mutually exclusive and exhaus- 
tive response classes. The response 
class being recorded in a given situa- 
tion will be designated A and the com- 
plementary class, 4. The dependent 
variable of the theory is the probability 
that the response occurring on a given 
trial is a member of class A. It is rec- 
ognized that in a learning experiment 
the behaviors available to the organism 
may be classified in many different 
Ways, depending upon the interests of 
the experimenter. The response class 
selected for investigation may be any- 
thing from the simplest reflex to a com- 
plex chain of behaviors involving many 
different groups of effectors. Adequate 


We He 


treatment of all levels of response 
specification would require the formula- 
tion of a model for the response space 
and will not be attempted in the pres- 
ent paper. Preliminary investigation 
of this problem leads us to believe that 
when a response model is elaborated, 
the theory developed in this paper will 
be found to hold rigorously for the most 
elementary response components and to 
a first approximation for simple re- 
sponse classes that do not involve 
chaining. For experimental verifica- 
tion of the present theory we shall 
look to experiments involving response 
classes no more complex than flexing a 
limb, depressing a bar, or moving a key. 


CONDITIONAL RELATIONS AND 
RESPONSE PROBABILITY 


We assume that the behavior of an 
organism on any trial is a function, not 
of the entire population of possible 
stimulus events, but only of those 
stimulus events which occur on that 
trial; further, when learning takes 
place, it involves a change in the de- 
pendency of the response upon the 
stimulus events which have occurred on 
the given trial. 

Conditional relations, or for brevity, 
connections, between response classes 
and stimulus elements are defined as in 
other papers on statistical learning 
theory (3, 7). The response classes 
4 and A define a partition of S* into 
two subsets S4* and Sj*. Elements 
* are said to be “connected to” 
or “conditioned to” response A; those 
in S;* to response A. The concept of 
a partition implies specifically that 
every element of S* must be connected 
either to A or to A but that no element 
may be connected to both simulta- 
neously. Various features of the model 
are illustrated in Fig. 1. 


is section could as well 
of S*, defin- 


in S4 


3 The argument of thi 
be given in terms of the set S as 
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For each element in S* we define a 
quantity Fi(n) representing the proba- 
bility that the element in question is 
connected to response A, i.e., is in the 
subset S4*, at the end of trial n. The 
mean value of Fi(n) over S* is, then, 
simply the expected proportion of ele- 
ments connected to A, and if all of the 
0; were equal, it would be natural to 
define this proportion as the probabil- 
ity, p(n), that response A occurs on 
trial n+ 1. In the general case, how- 
ever, not all of the 06; are equal and the 
contribution of each element should be 
weighted by its probability of occur- 
rence, giving 


XS 0;Fi(n) 1 
(1) pm) = DG = মঠ 0LFi(n). 
It will be seen that in the equal 0 
case, expression (1) reduces to 


(2) p0m)= নদ Fin) = E(F:(n)) 


which, except for changes in notation, is 
the definition used in previous papers 
(6, 7). 

The quantity f is, then, another of 
the principal constructs of the theory. 
It is referred to as a probability, firstly 
because we assume explicitly that quan- 
tities p are to be manipulated mathe- 
matically in accordance with the axioms 
of probability theory, and secondly be- 
cause in some situations p can be given 
a frequency interpretation. In any 
situation where a sequence of responses 
can be obtained under conditions of 
negligible learning and independent 
trials (as at the asymptote of a simple 
learning experiment carried out with dis- 
crete, well-spaced trials) the numerical 
value of f is taken as the average rela- 
tive frequency of response A. For all 
situations the construct p is assumed to 


ing S4 and Sj as the partition of S imposed 
by the response classes A and A. 
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correspond to a parameter of the be- 
havior system, and we do not cease to 
speak of this as a probability in the 
case of a situation where it cannot be 
evaluated as a relative frequency. It 
has been shown in a previous paper (7) 
that p can be related in a simple man- 
ner to rate or latency of responding in 
many situations; thus in all applica- 
tions of the theory, f is evaluated in 
accordance with the rules prescribed by 
the theory, either from frequency data 
or from other appropriate data, and 
once evaluated is treated for all mathe- 
matical purposes as a probability. 


REPRESENTATION OF LEARNING 
PROCESSES 


In order to account for the gradual 
course of learning in most situations, 
a number of the earlier quantitative 
theories, e.g., those of Hull (13), Gul- 
liksen and Wolfle (10), Thurstone 
(18) have assumed that individual con- 
nections are formed gradually over a 
series of learning trials. Once we adopt 
a statistical view of the stimulating 
situation, however, it can be shown 
rigorously that not only the gradual 
course of learning but the form of the 
typical learning curve can be accounted 
for in terms of probability considera- 
tions even if we assume that connec- 
tions are formed on an all-or-none 
basis. This being the case, there seems 
to be no evidence whatsoever that 
would require a postulate of gradual 
formation of individual connections. 
Psychologically an all-or-none assump- 
tion has the advantage of enabling us 
to account readily for the fact that 
learning is sudden in some situations 
and gradual in others; mathematically, 
it has the advantage of great simplicity. 
For these reasons, recent statistical 
theories of learning have adopted some 
form of the all-or-none assumption (3, 
7: 15), 


Under an all-or-none theory, we must 
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specify the probabilities that any stimu- 
lus element that is sampled on a given 
trial will become connected to A or to 
A. For convenience in exposition, we 
Shall limit ourselves in this paper to 
the simplest special case, i.e., a homoge- 
neous series of discrete trials with 
probability equal to one that all ele- 
ments occurring on a trial become con- 
nected to response A. 

We begin by asking what can be said 
about the course of learning during a 
sequence of trials regardless of the dis- 
tribution of stimulus events. It will 
be shown that our general assumptions 
define a family of mathematical opera- 
tors describing learning during any pre- 
scribed sequence of trials, the member 
of the family applicable in a given situ- 
ation depending upon the 0 distribution. 
We shall first inquire into the charac- 
teristics common to all members of a 
family, and then into the conditions 
under which the operators can be ap- 
Proximated adequately by the relatively 
simple functions that have been found 
convenient for representing learning 
data in previous work. 

Let us consider the course of learn- 
ing during a sequence of trials in the 
simplified situation. Each trial in the 
series is to begin with the presentation 
Of a certain stimulus complex. This 
situation defines a distribution of 0 over 
S* so that each element in S* has some 
Probability, 0;, of occurring on any trial, 
and we represent by S the subset of 
elements with non-zero @ values; any 
element that occurs on a trial becomes 
connected to A (or remains connected 
to A if it has been drawn on a previous 
trial). For concreteness the reader 
might think of a simple conditioning 
experiment with the CS preceding the 
US by an optimal interval, and with 
conditions arranged so that the UR is 
evoked on each trial and decremental 
factors are negligible; the situation rep- 
resented by S is that obtaining from the 
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onset of the CS to the onset of the US, 
and the response probability will re- 
fer to the probability of A in this situa- 
tion. The number of elements in S will 
be designated by N. For simplicity we 
shall suppose in the following deriva- 
tions that none of the elements in S are 
connected to A at the beginning of the 
experiment. This means that the learn- 
ing curves obtained all begin with N4 
and p equal to zero. No loss of gen- 
erality is involved in this simplifica- 
tion; our results may easily be extended 
to the case of any arbitrary initial 
condition. 

The it element in S will still remain 
in Sz after the nt trial if and only if 
it is not sampled on any of the first n 
trials; the likelihood that this occurs 
is (1-—0i)". Hence, if Fi(n) repre- 
sents the expected probability that this 
element is connected to A after the nt 


trial, we obtain: 


(3) Rint) = I= MU — 0)”. 

d number of elements ins 
fter the nt trial, 
he sum of these ex- 
individual 


The expecte 
connected to A a 
E[N4(n)], will be t 
pected contributions from 


elements: 
(4) E[NaA()] = 2 Fin) 
= & B= t= 0)"] 


=N— 2, (1-0). 


We are now | 
, the probability 0 
function of the number of trials in 


ituti term 
ituati substituting for the 
iE ) ‘its equivalent 


Fi(n) of equation (1 € 
from equation (3), we obtain the re 


lation 


of response 
in this 


EN ELE 0) 
ন ‘ 


|] 


(5) pn) 


| 


} # 
= ait — 0)" 
S| মঠ ( 
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Equation (5) defines a family of 
learning curves, one for each possible 
9 distribution, and it has a number of 
simple properties that are independent 
of the distribution of the 6;. It can 
easily be verified by substitution that 
there is a fixed point at f = 1, and this 
will be the asymptote approached by 
the curve of p(n) vs. # as 1 increases 
over all bounds. Members of the 
family will be monotonically increasing, 
negatively accelerated curves, approach- 
ing a simple negative growth function 
as the 0; tend toward equality. Tif all 
of the 0; are equal to 6, equation (5) 


reduces to 
(6) 20) =1-(-" 


which, except for a change in notation, 
is the same function derived previously 
(7) for the equal 0 case* and corre- 
sponds to the linear operator used by 
Bush and Mosteller (2) for situations 
where no decremental factor is in- 
volved. In mathematical form, equa- 
tion (6) is the same as Hull's well- 
known expression for growth of habit 
strength, but the function does not 
have the same relation to observed 
probability of responding in Hull's 
theory as in the present formulation. 

Except where the distribution func- 
lion of the 0; either is known, or can 
be assumed on theoretical grounds to 
be approximated by some simple ex- 
pression, equation (5) will not be con- 
venient to work with. In practice we 
are apt to assume equal 0; and utilize 
equation (6) to describe experimental 
data. The nature of the error of ap- 
proximation involved in doing this can 
be stated generally. Immediately after 
the first trial, the curve for the general 
case must lie above the curve for the 


4 This is essentially the same function de- 
veloped for the equal @ case in a previous 
paper (7); the terms 6 and n of equation (6) 
correspond to the terms g=s/S, and T of 
that paper. 
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Fic. 2. 
amples presented in the text. 
half of which have ( = 0.1 and half 9 
tion with § = 0.2. 


equal 0 case; the difference between the 
two curves increases for a few trials, 
then decreases until they cross (in con- 
structing hypothetical 0 distributions of 
diverse forms we have usually found 
this crossing in the neighborhood of 
the fourth to eighth trial); after cross- 
ing, the curves diverge to a smaller ex- 
tent than before, then come together as 
both go to the same asymptote at 
2=1. It can be proved that the 
Curves for the general and special case 
Cross exactly once as n goes from one 
to infinity. We cannot make any gen- 
eral statement about the maximum er- 
ror involved in approximating expres- 
sion (5) with expression (6), but after 
studying a number of special cases, we 
are inclined to believe that the error 
introduced by the approximation will 
be too small to be readily detectable 
experimentally for most simple learn- 
ing situations that do not involve com- 
pounding of stimuli. 

The development of equations (5) 
and (6) has necessarily been given in 
rather general terms, and it may be 
helpful to illustrate some of the con- 


i 


[e) [e) 20 


30 40 50 
n 


Response probability, in S, as a function of number of trials for the numerical ex- 
The solid curve is the exact solution for a population of elements, 
= 0.3. The dashed curve describes the equal § approxima- 
Initially no elements of S are conditioned to A. 


siderations involved by means of a sim- 
ple numerical example. Imagine that 
we are dealing with some particular 
conditioning experiment in which the 
CS can be represented by a set S, com- 
Posed of two subsets of stimulus ele- 
ments, S;, and S,, of the sizes N, = N; 
= N/2, where N is the number of ele- 
ments in S. Assume that for all ele- 
ments in S, the probability of being 
drawn on any trial is 0, = 0.3 and for 
those in S,, 0, = 0.1. Now we wish to 
compute the predicted learning curve 
during a series of trials on which A re- 
sponses are reinforced, assuming that 
we begin with all elements connected to 
Ad. Equation (5) becomes 


| 
b(n)= 1-2 0.3) 
X(1-0.3)"4N2(0.1)(1—0.1)"] 
1 
EEE 7)" 0.9)"]. 
1 53l0-30.7) 4+0.1(0.9) 


Plotting numerical values computes 

from this equation, we obtain the s0’1 

curve given in Fig. 2. b- 
Now let us approach the same pro 
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lem, but supposing this time that we 
know nothing about the different 6 
values in the subsets S, and S, and are 
given only that § = 0.2. We now ob- 
tain predicted learning curves under 
the equal 8 approximation. Equation 
(6) becomes: 


pn) = 1- (1-02) 


and numerical values computed from 
this yield the dashed curve of Fig. 2. 

Inspection of Fig. 2 shows that the 
exact treatment leads to higher values 
of p(n) on the early trials but to lower 
values on the later trials, the difference 
becoming negligible for large 1. The 
reason, in brief, for the steeper curva- 
ture of the exact curve is that elements 
with high 0 values are likely to be 
drawn, and therefore conditioned to A, 
earlier in the learning process than ele- 
ments with low 8 values, and then be- 
cause they will tend to recur frequently 
in successive samples, to lead to rela- 
tively high values of p. During the 
late stages of learning, elements with 
low 0 values that have not been drawn 
on the early trials will contribute more 
unconnected elements per trial than 
would be appearing at the same stage 
with an equal 0 distribution and will 
depress the value of p below the curve 
for the equal 6 approximation. 

It should be emphasized that the 
generality of the present approach to 
learning theory lies in the concepts in- 
troduced and the methods developed 
for operating with them, not in the 
particular equations derived. Equa- 
tion (5), for example, can be expected 
to apply only to an extremely narrow 
class of learning experiments. On the 
other hand, the methods utilized in de- 
riving equation (5) are applicable to a 
wide variety of situations. For the in- 
terest of the experimentally oriented 
reader, we will indicate briefly a few of 
the most obvious extensions of the 
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theory developed above, limiting our- 
selves to the equal 6 case. 

As written, equation (6) represents 
the predicted course of conditioning for 
a single organism with an initial re- 
sponse probability of zero. We can 
allow for the possibility that an experi- 
ment may begin at some value of 2(0) 
other than zero by rewriting (6) in the 
more general form 


(7) 20) =1-[1- 00-0 


which has the same form as (6) ex- 
cept for the initial value. 

If we wish to consider the mean 
course of conditioning in a group of m 
organisms with like values of 9 but 
varying initial response probabilities, 
we need simply sum equation (7) over 
the group and divide by m, obtaining 


®) 5) = 7200) 


1- [1-500 - 0)". 


The standard deviation of p(n) un- 
der these circumstances is simply 


|| 


0) et) = VEE PO) BO) 
= 6)" (0) 


where op(0) is the dispersion of the 
initial p values for the group. Varia- 
bility around the mean learning curve 
decreases to zero in a simple manner 
as learning progresses. 

The treatment of counter-condition- 
ing, i.e., extinguishing one response by 
giving uniform reinforcement to a com- 
peting response, follows automatically 
from our account of the acquisition 
process. Returning to equation (6) 
and recalling that the probabilities of A 
and A must always sum to unity, we 
note that while response A undergoes 
conditioning in accordance with (6), 
response A must undergo extinction in 
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accordance with the function 


Pin) = 1-— pan) = (1 - 6). 


If, then, we begin with any arbitrary 
2(0) and arrange conditions so that 4 
is evoked and conditioned to all ele- 
ments drawn on each trial, the ex- 
tinction of response A will be given by 
the simple decay function 


(10) p0) = p00 -— Br. 


Again the mean and standard deviation 
of p(n) can easily be computed for a 
group of organisms with like values of 
6 but varying values of (0): 


(11) 50) = 5O0)( — 0)" 
(12) on) = (1 — b)"c;(0). 


As in the case of acquisition, variability 
around the mean curve decreases to 
zero in a simple manner over a series of 
trials. 

Since variability due to variation in 
(0) is reduced during both condition- 
ing and counter-conditioning, it will be 
seen that in general we should expect 
less variability around a curve of re- 
learning than around a curve of original 
learning for a given group of subjects. 


APPLICATION OF THE STATISTICAL 
MopEL To LEARNING 
EXPERIMENTS 


Since our concern in this paper has 
been with the development of a stimu- 
lus model of considerable generality, it 
has been necessary in the interests of 
clear exposition to omit reference to 
most of the empirical material upon 
which our theoretical assumptions are 
based. The evaluation of the model 
must rest upon detailed interpretation 
of specific experimental situations. It 
iS clear, however, that the Statistical 
model developed here cannot be tested 
in isolation; only when it is taken to- 
gether with assumptions as to how 
learning occurs and with rules of cor- 
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respondence between terms of the 
theory and experimental variables, will 
experimental evaluation be possible. 
Limitations of space preclude a detailed 
theoretical analysis of individual learn- 
ing situations in this paper. In order 
to indicate how the model will be uti- 
lized and to suggest some of its ex- 
Planatory potentialities we shall con- 
clude with a few general remarks 
concerning the interpretation of learn- 
ing phenomena within the theoretical 
framework we have developed. 

Application of the model to any one 
isolated experiment will always involve 
an element of circularity, for informa- 
tion about a given 0 distribution must 
be obtained from behavioral data. 
This circularity disappears as soon as 
data are available from a number of 
related experiments. The utility of the 
concept is expected to lie in the possi- 
bility of predicting a variety of facts 
once the parameters of the 0 distribu- 
tion have been evaluated for a situa- 
tion. The methodology involved has 
been illustrated on a small scale by an 
experiment (6) in which the mean 0 
value for an operant conditioning situa- 
tion was estimated from the acquisition 
curve of a bar-pressing habit and then 
utilized in predicting the course of 
acquisition of a second bar-pressing 
habit by the same animals under 
slightly modified conditions. 

When the statistical model is taken 
together with an assumption of associa- 
tion by contiguity, we have the essen- 
tials of a theory of simple learning. 
The learning functions (5), (6), and 
(10) derived above should be expected 
to provide a description of the course 
of learning in certain elementary Jaa 
periments in the areas of conditioning 
and verbal association. It must be em- 
Phasized, however, that these functions 
alone will not constitute an adequate 
theory of conditioning, for a number of 
relevant variables, especially those con- 
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trolling response decrement, have not 
been taken into account in our deriva- 
tions. In conditioning experiments 
where decremental factors are mini- 
mized, there is considerable evidence 
(1, 4, 9, 14, 16) that the curve of con- 
ditioning has the principal properties of 
our equation (5) and can be well ap- 
proximated by the equal 8 case (7). 
The fact that functions derived from 
the model can be fitted to certain em- 
pirical curves is a desirable outcome, of 
course, but cannot be regarded as pro- 
viding a very exacting test of the 
theory; probably any contemporary 
quantitative theory will manage to ac- 
complish this much. On the other 
hand, the fact that the properties of our 
learning functions follow from the sta- 
tistical nature of the stimulating situa- 
tion is of some interest; in this respect 
the structure of the present theory is 
simpler than certain others, e.g., that 
of Hull (13), which require an inde- 
pendent postulate to account for the 
form of the conditioning curve. 

It should also be noted that devia- 
tions from the exponential curve form 
may be as significant as instances of 
good fit. From the present model we 
must predict a specific kind of deviation 
when the stimulating situation contains 
elements of widely varying 0 values. 
If, for example, curves of conditioning 
to two stimuli taken separately yield 
significantly different values of 6, then 
the curve of conditioning to a com- 
pound of the two stimuli should be ex- 
pected to deviate further than either of 
the separate curves from a simple 
growth function. The only relevant ex- 
periment we have discovered in the 
literature is one reported by Miller 
(16); Miller’s results appear to be in 
line with this analysis, but we would 
hesitate to regard this aspect of the 
theory as substantiated until additional 
relevant data become available. 

Although we shall not develop the 
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argument in mathematical detail in the 
present paper, it may be noted that the 
statistical ‘association theory yields cer- 
tain specific predictions concerning the 
effects of past learning upon the course 
of learning in a new situation. In gen- 
eral, the increment or decrement in p 
during any trial depends to a certain 
extent upon the immediately preceding 
sequence of trials. Suppose that we 
have two identical animals each of 
which has p(n) equal, say, to 0.5 at the 
end of trial n of an experiment, and 
suppose that for each animal response 
A is reinforced on trial n+ 1. The 
histories of the two animals are pre- 
sumed to differ in that the first animal 
has arrived at p(n)=0.5 via a se- 
quence of reinforced trials while the 
second animal has arrived at this value 
via a sequence of unreinforced trials. 
On trial 1 + 1, the second animal will 
receive the greater increment to p (ex- 
cept in the equal 0 case); the reason is, 
in brief, that for both animals the 
stimulus elements most likely to occur 
on trial # + 1 are those with high 0 
values; for the first animal these ele- 
ments will have occurred frequently 
during the immediately preceding se- 
quence of trials and thus will tend to 
be preponderantly connected to A prior 
to trial n+ 1; in the case of the sec- 
ond animal, the high 0 elements will 
have been connected to A during the 
immediately preceding sequence and 
thus when A is reinforced on trial 
n + 1, the second animal will receive 
the greater increment in weight of con- 
nected elements. From this analysis it 
follows that, other things equal, a curve 
of reconditioning will approach its 
asymptote more rapidly than the curve 
of original conditioning unless extinc- 
tion has actually been carried to zero. 
How important the role of the unequal 
9 distribution will prove to be in ac- 
counting for empirical phenomena of 
relearning cannot be adequately judged 
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until further research has provided 
means for estimating the orders of mag- 
nitude of the effects we have mentioned 
here. 


SUMMARY 


Earlier statistical treatments of sim- 
Ple associative learning have been re- 
fined and generalized by analyzing the 
stimulus concept in greater detail than 
heretofore and by taking account of the 
fact that different components of a 
stimulating situation may have differ- 
ent probabilities of affecting behavior. 

The population of stimulus events 
Corresponding to an independent ex- 
perimental variable is represented in 
the statistical model by a mathematical 
set. The relative frequencies with 
Which various aspects of the stimulus 
Variable affect behavior in a given ex- 
periment are represented by set opera- 
tions and functions. 

The statistical model, taken together 
with an assumption of association by 
contiguity, provides a limited theory 
of certain conditioning Phenomena. 
Within this theory it has been possible 
to distinguish aspects of the learning 
Process that depend upon Properties of 
the stimulating situation from those 
that do not. Certain general predic- 
tions from the theory concerning ac- 
quisition, extinction, and relearning, are 
compared with experimental findings. 

Salient characteristics of the model 
elaborated here are compared with 
other quantitative formulations of 
learning. 
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ANALYSIS OF A VERBAL CONDITIONING SITUATION IN 
TERMS OF STATISTICAL LEARNING THEORY! 


W. K. ESTES AND J. H. STRAUGHAN 


Indiana University 


It is the purpose of this study to 
investigate the theoretical significance 
of a rather striking coincidence be- 
tween an experimental fact and a 
mathematical fact. The experimental 
fact has been established in the 
Humphreys-type “verbal condition- 
ing” situation. In this situation Sis 
asked to predict on each of a series of 
trials whether some designated event, 
ce.g., the flash of a light, will occur; 
this event, the analogue of the USina 
conditioning experiment, 1s presented 
in accordance with a predetermined 
schedule, usually random with some 
fixed probability. Several recent in- 
vestigators (3, 5) have noted that § 
tends to match his response rate to 
the rate of occurrence of the predicted 
event so that if the probability of the 
latter is, say, 75, the mean response 
curve for a group of Ss tends over a 
series of trials toward an apparently 
stable final level at which the event 1s 
predicted on approximately 75% of 

behavior has seemed 


the trials. This t i bl : 
puzzling to most investigators since it 
does not maximize the PROpOrHS of 


successful predictions and thus does 


as facilitated by the senior 
culty research fellow of the 


h Council. 


1'This research w 
author’s tenure as a fa: 
Social Science Researc 


This article appeared il 


not conform to conventional law of 
effect doctrine. The mathematical 
fact which will concern us appeared in 
the course of developing the formal 
consequences Of statistical association 
theory (1, 2); in a simple associative 
learning situation satisfying certain 
conditions of symmetry, the theoretical 
asymptote of response probability 
turns out to be equal to the probabil- 
ity of reinforcement. The reasoning 
involved may be sketched briefly as 
follows. 

We consider a situation in which 
each trial begins with presentation of 
a signal, or CS; following the signal, 
one or the other of two reinforcing 
stimuli, E1 or Es, occurs, the proba- 
bility of Ei and Es during a given 
series being and 1—r, respectively. 
The behaviors available to S are cate- 
gorized into two classes, Ai and As, by 
experimental criteria. In the verbal 
conditioning situation, Ai is a predic- 
tion that Ei will occur, and As a pre- 
diction that Es» will occur on the given 
trial. We assume that the CS deter- 
mines a population, S., of stimulus 
elements which is sampled by S on 
each trial, the proportion 8 of the 
elements in this population constitut- 
ing the effective sample on any one 
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trial. The dependence of S$’s responses 
upon the stimulating situation is ex- 
pressed in the theory by defining a 
conditional relationship such that each 
element in Sc is conditioned to (tends 
to evoke) either A, or As. In order to 
interpret the formal model in terms of 
a verbal conditioning experiment, we 
assume that when an E,; occurs it 
evokes from S a response belonging to 
class Aj, i.e., one which is compatible 
with the response of predicting E;, but 
which interferes with the response of 
predicting Es, and that when an Es 
Occurs it evokes a response of class As. 
Then on a trial on which E; occurs we 
expect on the basis of association prin- 
ciples (1) that all elements sampled 
from S. on the trial will become con- 
ditioned to A; while on an Es trial the 
sample will be conditioned to As. 
Now if successive trials are sufficiently 
discrete so that samples from S, are 
statistically independent, the proba- 
bility of an A, after Trial 1, abbrevi- 
ated p(n), is defined in the model as 
the proportion of elements in Sc that 
are conditioned to Aj, and similarly 
for the probability of an As, [1 p(n). 
With these definitions the rule for cal- 
culating the change in response prob- 
ability on an E; trial may be stated 
formally as 


Pn +1 = (920) +e (0) 


and on an Es trial as 


204+ 1 = (504). 0) 


‘The genesis of these equations will be 
fairly obvious. The proportion (1-9) 
of stimulus elements is not sampled, 
and the status of elements that are not 
sampled on a trial does not change; 
the proportion 0 is sampled and these 
elements are all conditioned either to 
Ai or to As accordingly as an E; or an 
Es occurs.* Now in a random rein- 


2 Consequently the functions derived in this 
paper should be expected to apply only to learn- 
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forcement situation, Equation 1 will 
be applicable on the proportion of 
trials and Equation 2 on the propor- 
tion (l-7); then the average proba- 
bility of A; after Trial n + 1 will be 
given by the relation 


Bn + 1) = =[(1-9) p(n) + 6] 
+(01-n)0-0)50n) (3) 
= (1 — 0)5(n) + 0. 


If a group of Ss begins an experiment 
with the value 5(0), then at the end 
of Trial 1 we would have 


B(1) = (1 -— 0)5(0) + 6, 
at the end of Trial 2 


B02) = (1 — [0 - 9) 50) 
+ 67] + Tr 
= 1 — [7 — B(0)J( - 0), 


and so on for successive trials; in 
general it can be shown by induction 
that at the end of the nth trial 


Bn) = 1 — [7 — BOC — oy. 


Since (1 — 0) must be a fraction be- 
tween zero and one, it will be seen 
that Equation 4 must be a negatively 
accelerated curve running from the 
initial value B(0) to the asymptotic 
value mr. 

‘This outcome of the statistical learn- 
ing model is rather surprising at first 
since it makes asymptotic response 
Probability depend solely upon the 
Probability of reinforcement. It seems, 
however, to be in excellent agreement 
with the experimental results of Grant, 
Hake, and Hornseth (3) and Jarvik 
(5). ‘The question that interests us 


ing situations which are symmetrical in the fol- 
lowing sense. To each response class there must 
correspond a reinforcing condition which, if pres- 
ent on any trial, ensures that a response belong- 
ing to the class will terminate the trial. These 
functions should, for example, be ABBUC 
learning of a simple left-right discrimination wit 
correction; but not to 2 left-right discrimination 
without correction, to free responding in the 
Skinner box, or to Pavlovian conditioning. 
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now is whether this agreement is to be 
regarded as a remarkable coincidence 
or as a confirmation of the theory. 
We cannot estimate a confidence level 
for the latter conclusion since the ex- 
periments were not conducted specifi- 
cally to test the theory, and we cannot 
guarantee that we would be as alert to 
notice results contrary to the theory 
which might appear in the literature 
as we have been in the case of these 
decidedly positive instances. It has 
seemed to us that the least objection- 
able way out of this impasse is to 
carry out some new experiments, mak- 
ing use of one of the convenient fea- 
tures of a mathematical theory, namely, 
it will generate one testable 


that if t 
prediction for a given experimental 
situation, it can generally be made to 


yield many more. In the experiment 
to be reported we have tried to set up 
2 situation similar in essentials to that 
used by Humphreys, Grant, and others 
with an experimental design which 
would permit testing of a variety of 
consequences of the theory. Each 5S 
was run through two successive series 
of 120 trials in an individualized modi- 
fication of the Humphreys situation 
with the schedule of T values shown in 
Table 1. Within the first series we 
will be able to compare learning rates 
and asymptotes of groups starting 
from similar initial values but exposed 
to different probabilities of reinforce- 
ment; within the second series we will 
be able to compare groups starting at 
different initial values but exposed to 
the same probabilities of reinforce- 
ment. Comparison of Group I with 
the other groups over both series will 
permit evaluation of the stability of 
learning rate (9 value) from series to 
series when the T value does or does 
not change. Series IA and series Is 
will provide a comparison in which 
initial response probabilities and 
values are the same but the amount of 
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TABLE 1 


EXPERIMENTAL DESIGN IN TERMS OF PROBA- 
BILITY OF REINFORCEMENT (T VALUE) 
DURING EACH SERIES 


css] SF | TEL Tes 
EE 30 30 
JL ee ‘50 30 
I | 16 8 30 


preceding reinforcement differs. In 
order to separate the effect of over-all 
7 value from that of particular orders 
of event occurrences, each of the three 
groups indicated in Table 1 has been 
subdivided into four subgroups of four 
Ss each ; within a treatment group, say 
Group I, all subgroups have the same 
T value but each receives a separate 
randomly drawn sequence of E's and 
Ess. 
MErTHop 


Apparatus.—The experiment was run in a 
room containing a 2-ft. square signal board and 
four booths. Upon the signal board were 
mounted 12 12-v., .25-amp. light bulbs spaced 
evenly in a circle 18 in. in diameter. ‘The bulbs 
occupied the half-hour positions of a clock face. 
Only the top two lights on the board were used 
as signals in this experiment. The signal board 
was mounted vertically on a table 40 in. high 
and was about 5 ft. in front of Ss’ booths. 

The booths were made from two 30 X 60 in. 
tables, 30 in. high, placed end to end but meeting 
at an angle so that Ss sitting behind them would 
be facing almost directly toward the signal board, 
about 7 ft. in front of Ss’ eyes. Two Ss sat at 
each table. The four Ss were separated from 
one another by panels 2 ft. high and 32 in. wide. 
These panels were mounted vertically on the 
table tops so as to extend 14} in. beyond the 
edge of the table between the seated Ss. 

Tn each booth, 18 in. back from S’s edge of 
the table, was a wooden panel 12 in. high 
mounted vertically on the table top and extend- 
ing across the width of the booth. On the side 
of this panel facing § were two reinforcing lights 
of the same size as those on the signal board but 
covered by white, translucent lenses. These 
lights were directly in front of S, 4 in. apart and 
8 in. above the table top. On the table below 
each reinforcing light was a telegraph key. 

The orders of presentation and the durations 
of the signal lights and reinfercing lights were 
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controlled by a modified Esterline-Angus re- 
corder using a punched tape and a system of 
electrical pick-up brushes. The recorder was 
placed on the table behind the signal board. 
Recorder pens which were activated by depres- 
sion of the telegraph keys in Ss’ booths were 
mounted between the brushes. Thus, the pre- 
sentations of the lights and Ss’ responses were 
recorded on the same tape. A panel light was 
mounted above the Esterline-Angus recorder so 
that E, seated behind the signal board, could 
watch the operation of brushes and pens during 
the experiment. 

Windows in the experimental room were cov- 
ered with opaque material and the experiment 
was run in darkness except for light that came 
from the apparatus. 

Subjects.— The Ss were 48 students obtained 
from beginning lecture courses in psychology 
during the fall semester of 1952 and assigned at 
random to experimental groups. 

Procedure.—At the beginning of a session, Ss 
were brought into the room, asked to be seated, 
and read the following instructions: 

“Be sure you are seated comfortably; it will 
be necessary to keep one hand resting lightly 
beside each of the telegraph keys throughout the 
experiment and to watch both the large board in 
the front of the room and the two small lights in 
Your own compartment. Your task in this ex- 
periment will be to outguess the experimenter on 
each trial, or at least as often as you can. The 
ready signal on each trial will be a flash from the 
two top lights on the big board. About a second 
later either the left or the right lamp in your 
compartment will light for a moment. As soon 
as the ready signal flashes you are to guess 
whether the left or the right lamp will light on 
that trial and indicate your choice by pressing 
the Proper key. If you expect the left lamp to 
light, press the left key; if you expect the right 
lamp to light, press the right key; if you are not 
sure, guess. Be sure to make your choice as soon 
as the ready signal appears, press the proper key 
down firmly, then release the key before the 
ready signal goes off. It is important that you 
Press either the left or the right key, never both, 
on each trial, and that you make your decision 
and indicate your choice while the signal light 
1S On. 

“Now we will give you four practice trials.” 

At this point the overhead lights were extin- 
guished and the recorder started. If any obvi- 
ous mistakes were made by S during the four 
practice trials, they were pointed out by E. 
During the four practice trials the reinforcing 
lights were always given in the order: Ei, Ei, 
Es, E2. After the practice trials the following 

instructions were read: 
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“Are you sure you understand all of the in- 
structions so far? The rest of the trials will 
have to be run off without any conversation or 
other interruptions. Please make a choice on 
every trial even if it scems difficult. Make a 
guess on the first trial, then try to improve your 
guesses as you go along and make as many cor- 
rect choices as possible.” y 

Questions were answered by rereading or 
paraphrasing the appropriate part of the instruc- 
tions. If there were any questions about tricks 
the following additional paragraph was read. j 

“We have told you everything that will 
happen. There are no tricks or catches in this 
experiment. We simply want to see how well 
you can profit from experience in a rather difh- 
cult problem-solving situation while working 
under time pressure.” 

The recorder was now started again and the 
240 experimental trials were run off in a con- 
tinuous sequence with no break or other indica- 
tion to S at the transition from Series A to 
Series B. On each trial, the signal lamps were 
lighted for approximately 2 sec.; 1 sec. later the 
appropriate reinforcing light in cach S’s booth 
lighted for .8 sec.; then after an interval of .4 
sec. the next ready signal appeared; and so on. 
The high rate of stimulus presentation was used 


in order to minimize verbalization on the part of 
Ss. 


REsuLTs AND DiscussIoN 

Terminal response probabilities.—lt 
will be clear from our discussion of 
Equation 4 that the predicted asymp- 
tote for each series will be the value 
of sr obtaining during the series. We 
have taken the mean proportion of A 
responses during the last 40 trials of 
each series as an estimate of terminal 
response probability, and these values 
are summarized for all groups and 
both series in Table 2. 


TABLE 2 


TERMINAL MEAN RESPONSE PROBABILITIES 
FOR EACH SERIES 


Series A Series B 
Group 
2 Ed t BR bd # 
I 37°] 30° | 235; | 28.) 30 0% 
II 48 | 50 | 0.55 | .37 | 30 i 
II 87 | -85 |0.55 | 30 | .30 | 0. 
iE 
F {69.31 2.98 
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For the first series a simple analysis 
of variance yields an F significant 
beyond the .001 level for differences 
among means. From the within- 
groups variance estimate we obtain a 
value for the standard error of a group 
mean, and this is used in the t test 
between each group mean and the 
appropriate theoretical mean. For 
the second series the between-groups 
F has a probability between the .05 
and .10 levels. In neither series were 
differences among subgroup means 
significant at the .05 level. 

The interpretation seems straight- 
Group III approximates 
asymptote in both 


the theoretical ei 
series. Group I falls significantly 
short of the theoretical asymptote in 


the first series but approximates itin 
the second series. Group II falls sig- 
nificantly short of the theoretical 
asymptote in the second series, but 
reaches the same probability level as 
had Group lin the first series. Of the 
t tests computed for differences be- 
tween the last two blocks of 20 trials 
in each series, all yielded probabilities 
greater than ‘10 except thet for Series 
IIs which was significant at the .02 
level. Evidently the predictions con- 
cerning mean asymptotic values are 
correct, but the rate of approach to 
asymptote is faster with Group III 
than under the other conditions. 
According to theory, not only group 
means, but also individual curves 
should approach asymptotically. 
To obtain evidence as to the tenability 
of this aspect of the theory we have 
examined the distributions of indi- 
vidual Ai response proportions for the 
last 40 trials of Series Il a, Ills, and 


Ifall individual p values approxi- 


B. 
mate the theoretical asymptotes over 
or each of the series 


these trials, then fc টী 
the individual response proportions 
the mean value, 


should cluster around 1 J 
T, with an approximately binomial 


forward. 
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Fic. 1. Empirical and theoretical curves rep- 


resenting mean proportion of Ei predictions (A1 
responses) per 20-trial block for each series 


distribution. Taking the theoretical o 
equal to N40 (I — 7), which is actu- 
ally a slight underestimate of the true 
value, we find that approximately half 
of the scores in each series fall within 
one oc of the theoretical asymptote and 
only one score in each series deviates 
by more than three c. It appears, 
then, that except for a few widely devi- 
ant cases the p values of individual Ss 
approach the theoretical asymptote. 

One might raise a question as to just 
what is meant by the asymptote of an 
empirical curve in a situation of this 
kind. Naturally one would not expect 
the Ss to perform at constant rates 
indefinitely. It does not seem that 
any sort of breaking point was ap- 
proached in the present study, how- 
ever; one subgroup of Group I was 
run for an additional 60 trials beyond 
Trial 240 and maintained an average 
proportion of .304 A; responses over 
these trials. 

Mean learning curves.—lIn Fig. L 
mean data are plotted in terms of the 
proportion of A, responses per block of 
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20 trials. The theoretical function 
which should describe these empirical 
curves is readily obtained from Equa- 
tion4. Letting m be the ordinal num- 
ber of a block of 20 trials running from 
Trial n + 1 to Trial n + 20 inclusive, 
and P(m) the expected proportion of 
Aj responses in the block, we can write 


B(m)=rT 
Lr = 5010 — 20m 
= 200 
EE ("01 (5) 


this expression being simply the mean 
value of p(n) over the mth block of 20 
trials. According to theory, Equa- 
tion 5 should describe each of the 
mean curves of Fig. 1 once numerical 
values are substituted for the param- 
eters qT, B(O), and 8; furthermore, the 
value of 6 required should not differ 
among groups within either series and 
should be constant from series to series 
for each group. The values of Tr are 
of course fixed by the experimental 
procedure. The values of 5(0) in the 
first series should be in the neighbor- 
hood of .50, but for groups of size 16 
sampling deviations could be quite 
large so it will be best to get rid of 
B(0) in favor of P(1) which can be 
measured more accurately. To do 
this we write Equation 5 form = 1 


B(l)=r 
L — 50)] 20 
=~ B= 0-9" 
then solve for [x — 5(0)] 
206 [7 — B()] 
1-— (1-— 09)» 


and substitute this result into Equa- 
tion 5 giving 


[r — 50)] = 


P(m)=r 
— [# — BOI — nm, (6) 


Observed values of P(1) turn out to 
be .58 and .59 for Series IA and IIIA, 
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respectively. Now we lack only em- 
pirical estimates of 6 and these can be 
Obtained by a simple statistical pro- 
cedure. The method we have used is 
to sum Equation 6 over all values of 
mm, Obtaining for K blocks of trials 


2 r—[r-— POA ্ Om 

ls = Kr -—[r-— P(] 
[1 0-o] 
1- (1-0) 


then equate Equation 7 to the sum of 
the observed proportions for a given 
series and solve for 0. For Group I 
we obtain the estimate 6 = .018 and 
for IIlIlA,0 = .08. Using these param- 
eter values we have computed the 
theoretical curves for Group I and for 
the first series of Group III, which 
may be seen in Fig. 1. In this anal- 
ysis we find agreement between data 
and theory in one respect but not in 
another. The theoretical curves pro- 
vide reasonably good descriptions of 
the observed points, especially in the 
case of Group I, but the 6 values for 
the two groups are by no means equal. 
‘The latter finding does not come as a 
surprise inasmuch as we had found in 
the previous section that Group I was 
significantly short of its theoretical 
asymptote in the first series, while 
Group III was not. 

We did not try to estimate a 0 value 
for the first series of Group II since the 
empirical curve is virtually horizontal 
and closely approximates the line 
B(m) = 1 = .50. We could proceed 
to estimate 0 values for Series Ils 
and IIIs by the method used above, 
but it will be of more interest to con- 
struct predicted curves for these series 
without using any additional informa- 
tion from the data. According to the 
theory, it should be possible to com- 
pute those curves from information 
already at our disposal. The (0) 
values in the second series should be 


(7) 
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TABLE 3 will be expected to have no effect 

PREDICTED AND OBSERVED MEAN FREQUENCIES except insofar as it leads to a change in 
OF THE A1 RESPONSE IN THE B(0), so except for sampling error the 

SEcoND SERIES 9 value estimated for Group I should 

3 be applicable to IIs. Using 50,330, 

Group served. | ‘Fretllcted £ and hs as the values of 5(0), gE and 

I 37.19 1 0% 0, respectively, we have computed a 

7 79 42:86 004 theoretical curve for Series Ils, and 
this is plotted in Fig. 1. Similarly, the 

F 3.16 (p> .05) 9 value estimated for Series III Ashould 


EE Hee Hi 
apply also to Ills, and we have used 
this value, .08, together with .30 for 


the theoretical asymptotes of the first hs 
series, or .50 5 5 for Groups II and 85 for B(0) to compute the pre- 
and IIL, respectively. The only pro- dicted curve for Ills shown in Fig. 1. 
cedural difference between Ix and IIs Considering that no degrees of free- 
lies in the number of preceding rein- dom in the Series B data have been 

utilized in curve fitting, the corre- 


forcements; according to the statis- ই 
tical model, however, this variable spondence between the theoretical and 
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empirical curves does not seem bad. 
The reason for some of the irregulari- 
ties will be brought out in the next 
section. A statistical test of one 
aspect of the correspondence can be 
obtained by calculating for each theo- 
retical curve a predicted mean total 
of A; responses in the second series, 
by means of Equation 5, and com- 
paring these values with the observed 
mean totals. This has been done and 
the comparison is given in Table 3. 
The t values for differences between 
observed and theoretical values seem 
satisfactorily low. 

In order to give an idea of the extent 
to which the behavior of individual Ss 
conforms to the theoretical function, 
we have plotted in Fig. 2 the indi- 
vidual cumulative response curves for 
all Ss of Group I. The cumulative 
form was chosen for the smoothing 
effect, some of the noncumulative 
curves being too irregular for curve- 
fitting purposes. ‘The theoretical 
curves in Fig. 2 represent Equation 7 
with 0 values obtained by a method of 
approximation. ‘Ten of the curves are 
fitted quite well by this function with 
T = .30 as the asymptote parameter. 
Four curves, Numbers 2, 11, 15, 16, 
require other values for this param- 
eter, viz., .075, .45, .24, and .18, respec- 
tively. Curves 3 and 4 deviate con- 
siderably from the theoretical form. 
In general, it appears that the empir- 
ical curves for most individual Ss can 
be described quite satisfactorily by 
the theoretical function, and this fact 
gives us some basis for inferring that 
in this situation mean learning curves 
for groups of Ss reflect the trend of 
individual learning uncomplicated by 
any gross artifacts of averaging. 

‘The effect of 120 reinforcements at a 
T value of .50 may be evaluated by 
comparing curve forms and mean A; 
response totals for Series IA and IIs. 
We find that the reinforcements lead 
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to no increase in resistance to change. 
Slopes of the two curves are very 
similar and the response totals do not 
differ significantly. This result is in 
line with predictions from the statis- 
tical model, but a little surprising, 
perhaps, from the viewpoint of Thorn- 
dikian or Hullian reinforcement the- 
ory since partial reinforcement has 
generally (6) been held to increase 
resistance to extinction in this 
situation. 

The conclusions from our study of 
the mean learning curves would seem 
to be (a) that under some circum- 
stances at least it is possible to evalu- 
ate theoretical parameters from the 
data of one series of learning trials and 
then to predict the course of learning 
in a new series; and (b) that the rate 
at which the mean learning curve 
approaches its asymptote depends, in 
an as yet incompletely specified man- 
ner, upon the difference between initial 
response probability and the proba- 
bility of reinforcement obtaining dur- 
ing the series. 

Sequence effects.— The mean curves 
studied in the preceding section may 
not reflect adequately all of the learn- 
ing that went on during the experi- 
ment. The irregularities in some 0 
the mean curves of Fig. 1 might be 
accounted for if there is a significant 
tendency for Ss’ response sequences to 
follow the vagaries of the sequences 0 
Ei’s and Ess. ‘To check on this possi- 
bility we have plotted in Fig. 3 the 
mean proportions of Ai responses Vs. 
frequencies of E; occurrences per !U- 
trial block for all groups in Series B. 
In preparing this graph, the 120 trials 
of Series B were divided into 12 suc- 
cessive blocks of 10. Since there were 
48 Ss, there were 576 of these trial 
blocks and they were classified accord- 
ing to the number of Ei; occurrences 
in a block. Then for the set of all 
blocks in which no E's occurred, the 
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mean proportion of A; responses was 
computed and entered as the first 
point in Fig. 3, and so on for the 
remaining points. It seems clear that 
Ss were responding to the particular 
sequences of Er's and Ess, and not 
simply to the over-all rate. Corre- 
sponding graphs for the three groups 
in the first series had somewhat shal- 
lower slopes; they have not been 
reported since some of the individual 
points were based on too few cases to 
be reliable and the groups could not 
be averaged together in the first series 
owing to the different values. 

In order to deal statistically with 
this apparent dependence of response 
tendency upon the density of Ei occur- 
rences in the immediately preceding 
sequence, we have computed for each 
series the average probability, DA, IE 
that an Ai occurs on Trial n given that 


an Ei occurs on Trial n — 1 and the 


average probability, DA; IE? that an 
n that an Es 


A; occurs on Trial n give 

SEcUrS om THAl 1. The differ- 
ence between these two quantities can 
0 be proportional to the 
ation (7) between A(n) 
1) for 2 given series. 
ations 1 and 2 


be shown t 
point correl 
and E(n- 
Furthermore our Equ 
may be regarded as theoretical expres- 
sions for the two conditional proba- 


৮ 
£ 


PROPORTION 
PREDICTIONS 


E 
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Fic. 3. Mean proportion of Ei predictions 
(A; responses) in a block of ten trials plotted 
against the actual number of Ei occurrences in 
the block; data averaged for all groups in Series 


B 


STES AND J. H. STRAUGHAN 


351 


TABLE 4 


MEAN DIFFERENCES BETWEEN OBSERVED 
VALUES OF DPAIE, AND PALE: 
FOR EACH SERIES 


Series Group I Group Il | Group II 
A 128 42 153 
B 214 294 -231 


bilities, PDA,iE, and PA,IEs respectively, 
and it will be seen that if these expres- 
sions are averaged over all values of mn 
in a series and the second subtracted 
from the first, the difference is equal 
to the parameter 0, i.e., 

1-95mt+e-U0- 6)5(n) = 0. 
Thus from the statistical model we 
must predict that the difference be- 
tween empirical estimates of these 
conditional probabilities for any series 
should be positive and, if successive 
trials are independent, this difference 
should be equal to the value of 0 esti- 
mated from the mean response curve. 
The conditional probabilities have 
been computed from the data for each 
S and mean differences by groups are 
summarized in Table 4. 

All of the differences are positive 
and significant at better than the .001 
level of confidence. The differences 
among group means are insignificant 
for both series (F’s equal to .45 and 
al; respectively) as are differences 
among subgroup means. Theincreases 
from the first series to the second are, 
however, significant beyond the .005 
level. The latter effect was not antici- 
on theoretical grounds; the 
most plausible explanation that has 
occurred to us is that alternation 
tendencies associated with previously 
established guessing habits extin- 
guished during the early part of the 
experiment. This hypothesis would 
also account for the high P(1) value 
observed for Group I in Fig. 1. 

Although all of the quantities in 
Table 4 are positive and apparently 


pated 
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independent of Tr, as required by the 
theory, the numerical values are all 
larger than the 6 estimates obtained 
from mean response curves. The most 
straightforward interpretation of this 
disparity would be that, owing to the 
short intertrial interval, successive 
trials are not independent in the sense 
required by the theoretical model. 
Nonindependence would have at least 
two immediate consequences in so far 
as the present experiment is concerned. 
First, stimulus samples drawn on suc- 
cessive trials would overlap, and the 
learning that occurred on one trial 
would affect behavior on the next to a 
greater extent than random sampling 
would allow for, thus increasing DA,E,, 
and decreasing BA,Es,. Second, the re- 
inforcing stimulus of one trial, E; or 
Es, would be part of the stimulus com- 
plex effective at the beginning of the 
next trial. If this interpretation is 
correct, then more widely spaced trials 
should result in better agreement be- 
tween the alternative estimates of 06 
and also in reduction of the depend- 
ence of mean learning rate upon prob- 
ability of reinforcement. 


SUMMARY 


Learning rates, asymptotic behavior, and 
sequential properties of response in a verbal con- 
ditioning situation were studied in relation to 
predictions from statistical learning theory. 

Forty-eight college students were run in an 
individualized modification of the “verbal condi- 
tioning” experiment originated by Humphreys 
(4). Each trial consisted in presentation of a 
signal followed by a left-hand or right-hand 
“reinforcing” light; S operated an appropriate 
key to indicate his prediction as to which light 
would appear on each trial. For each S one of 
the lights, selected randomly, was designatedas 
Es, the other as Es. On the first series of 120 
trials, E1 occurred with probability .30, .50, and 
85 for Groups I, II, and III, respectively. On 
the second 120 trials, Ei occurred with proba- 
bility .30 for all groups. 

Theoretical predictions were that mean proba- 
bility of predicting Ei should tend asymptoti- 
cally to the actual probability of E;, both during 
original learning and following a shift in proba- 
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bility of reinforcement; and that response 
probabilities should change in accordance with 
exponential functions, learning rates (as meas- 
ured by slope parameters) being independent of 
both initial condition and probability of rein- 
forcement. 

The statistical criterion for approach to theo- 
retical asymptote was met by Group I by the 
end of the second series and by Group III in 
both first and second series. In the second 
series, Group II was short of theoretical asymp- 
tote but reached the same response probability 
as had Group I during the first series. 

Learning rates were virtually identical for 
Group I, first series, and Group Il, second series, 
indicating that resistance of response probability 
to change is not altered by 50% random rein- 
forcement in this situation. Learning rates dif- 
fered significantly among groups within both 
series. In general, learning rate was directly 
related to difference between initial response 
Probability and probability of reinforcement 
during a series. It was suggested that this rela- 
tionship may depend upon temporal massing of 
trials. Not only group means, but individual 
learning curves could be described satisfactorily 
by theoretical functions. 

No tendency was observed for Ss to respond 
to a series as a whole. On the contrary, sensi- 
tivity to effects of individual reinforcements and 
nonreinforcements (E; and Es occurrences) in- 
creased significantly as a function of trials. 
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VAN INVESTIGATION OF SOME MATHEMATICAL 
MODELS FOR LEARNING! 


CURT F. FEY? 


University of Pennsylvania 


In this study an attempt is made 
to determine whether the results of 
two different learning experiments 
can be described by stochastic models 
proposed by Bush and Mosteller 
(1955) and Luce (1959) without 
changing the model parameters. 

The merit of a model lies in its 
ability to describe and predict data 
successfully with the aid of a mini- 
mum of free parameters. For any 
one experiment this can be done with 
several models. Consequently a more 
stringent test of a model is its ability 
to predict the fine structure of the 
data with one invariant set of param- 
eters in such a way that once the 
values of the parameters are deter- 
mined in one experiment these same 
parameters can be used to predict 
the outcome of another experiment. 

Galanter and Bush (1959) previ- 
ously studied parameter invariance 
in the linear model of Bush and 
Mosteller (1955). Their analysis 
showed an apparent lack of parameter 
invariance in a T-maze situation, but 
it is not clear whether the lack of 
parameter invariance was attributable 
to a basic mechanism in the model 
or was a consequence of sampling 
errors and difficulties in estimating 
parameters. 


The purpose of the present study 


1 Based on the author's PhD dissertation 
sed by R. R. Bush, and read by R. D. 


d J. Beck. The data analysis was 
Center of the 


h the assis- 


tance of S. Gorn and P. Z. Ingerman. 
2 Now at General Dynamics 
Rochester, New York. 


This article appeare 


is to investigate this question of 
parameter invariance in greater de- 
tail. The experimental design was 
improved over that of Galanter and 
Bush (1959) by running only one 
rather than three trials per day, and 
it was extended to provide a com- 
parison between 100% reinforcement 
and 75% random reinforcement. 


THE MODELS 


‘he two models used in this paper may 
be designated as the alpha model (Bush 
& Mosteller, 1955) and the beta model 
(Luce, 1959). Each of them uses linear 
transformations. In the alpha model 
the linear transformation is applied to the 
response probability i in the beta 
model it is applied to the quantity 
0-2): 

Both of these models are stochastic, 
i.e., they deal with probabilities of mak- 
ing responses. The models are path- 
independent: the response probability 
on a given trial depends only on the 
response probability and the outcome 
on the previous trial. 

An animalin a T maze can turn either 
to the left or to the right on any given 
trial. The models state that if on one 
trial S§ makes a response for which it 
gets rewarded, then the probability of 
making that same response on the next 
trial increases. The models specify the 
manner of these changes. 

Let pn be the probability of going 
to the right-hand side (probability of an 
verror'') on trial nm; let qn = 1-— bi; 
let oi, a2, Bi and Bs be nonnegative 
parameters such that a: and Bi are 
associated with reward and «2 and B: are 
associated with nonreward. The models 
can then be defined in the following 
way: 


d in J. exp. Psychol, 1961, 61, 455-461. Reprinted with permission. 
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Alpha Model Response Outcome Beta Model 
Dn 
Papi =f left turn reward bist = as NESTS 
igh ard NEG: TE 
bast = 2p right turn nonrewar' bayt = eg) 
$ Bipn 
Gn+1 = Qiqn right turn reward Pas = REI Ee 
টা lef reward = = Bh 
Qn+1 = a2qn eft turn nonrewar' ba4i = FFT 
Mathematical properties of the alpha METHOD: 


model were listed by Galanter and Bush 
(1959, pp. 27125273) for the special 
condition that a left response is always 
rewarded and a right-hand response is 
never rewarded, (100:0). For the beta 
model the mathematical properties have 
been determined by Kanal (1960, 1961), 
Bush, Galanter, and Luce (1959, p. 387), 
and Bush (1960). 


Subjects.—The Ss were male hooded rats 
of the Long-Evans strain, from Rockland 
Farms, New York City, New York. They 
weighed about 75 Em. on arrival. Eight rats 
were used for the preliminary experiment. In 
the main experiment 63 rats were used, but 
the final N = 50, because 13 died during the 
experiment. 
bastante 


2 For details see Fey, 1960. 


TABLE 1 
PERion 2, 100:0 GRour 


COMPARISON OF STATISTICS FROM THE FIRST 


3S TRIALS OF EXPERIMENTAL GROUP WITH 


CORRESPONDING MODEL VALUES CALCULATED WITH Pi = 1, a1 = 858, AND 


«2 = .955 FOR ALPHA MODEL 


AND Pp, = .97, 8, = .952, AND 


B? = .642, FOR BETA MopEL 


Means Standard Errors 
Statistic 
Exp. a Model | B Model Exp. a Model | B Model 
Number of Ss 25 100 500 
Number of trials 35 35 Et 
Total number of errors 12.28 12.28 12.39 .76 .00 012 
Trial of last error 23.16 | 22.10 | 2645 | 100 125 ‘028 
Trial of first success 6.88 6.87 6.49 48 .00 024 
Number of RR sequences 7.48 7.32 7.01 .56 .00 014 
RL 4.76 4:85 5:32 28 .09 -016 
LR 380 3.85 4.41 28 10 2 
96 17.98 | 17:26 চি 22 : 
NE L runs of: i 2 
ength 1 2.00 1.81 1.86 .20 10 0! 
2 56 81 87 08 03 014 
3 44 51 87 ‘08 03 -006 
4 2 37 38 ‘04 04 Se 
ট 28 ly ১ 
Number of R runs of: oS 
Length 1 2.60 2.70 3:12 24 .00 014 
2 80 81 83 16 00 .008 
3 40 40 36 08 .00 ‘006 
4 24 2 28 ‘04 ‘00 .008 
5 .20 .20 .20 04 .00 004 
Total number of R runs 4:80 4:96 5:38 28 00 018 


Note.—Standard error of the mean ‘Was computed from range approximation. 
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TABLE 2 


PERIOD 2, 75:25 GROUP 


COMPARISON OF STATISTICS OF 75:25 EXPERIMENTAL GROUP WITH CORRESPONDING 
MODEL VALUES CALCULATED WITH fi = lL, a1 = 858, AND «2 = .955 FOR 
ALPHA MODEL AND pi = .91, Bi = 952, AND B? = .642 
FOR BETA MODEL 


Standard Errors 


Means 
Statistic 
Exp. a Model B Model 2 Exp. a Model B Model 
Number of animals 25 100 200 
Total number of errors 15.92 | 19.21 | 1943 | 104 | 22 | 03 
Trial of last error 27.32 32.71 33.59 1.04 Ps! .045 
Trial of first success en “e Re 8 I 9 
equences , . ' . 2) . 
NEUROL ROS S44 | L059 E33 28 | do | 035 
IR 1:68 637 | 675 | 32 ‘09 ‘035 
LL 13:64 869.) S| 42 24 ‘08 
fr 
NUE as 2.48 3168. |. 3:55, 20 10 045 
2 1.16 162: L68:|- 20 ‘05 ‘03 
40 ‘77 | 109 | ‘08 ‘03 ‘02 
4 ‘24 4 £52. | 04 ‘02 ‘015 
5 ‘16 23 21 | ‘04 ‘02 01 
রব f : 
NUE wy Re ides 288° | 332:| “370 | 20 | 40 | 035 
2 1.16 150] Res se ‘05 ‘03 
5 48 ‘66 ‘79 12 ‘04 ‘015 
4 36 54 ‘47 08 ‘03 ‘02 
‘20 38 99 08 ‘02 ‘01 
Total number of R runs 5:68 736 | 7:62 32 ‘09 ‘035 


e.—The model parameters were estimated from the 100.0 group. Standard error was computed from 


Not I n 
range approximation. 


mash to balance olfactory cues, and the top 
contained the reward pellet. 

Procedure.—This experiment consisted of 
three parts: (a) preliminary handling; (6) 
straight alley pretraining; and (c) T-maze 
learning. 

The Ss were kept in the laboratory for 23 
days at ad lib. food and water and were 
handled daily. Then Ss were deprived of 
food for 18, 21, 214, 213, and 22 hr. on Days 
24, 25, 26, 27, and 28, respectively. 

The pretraining started on Day 29. For 
the remainder of the experiment Ss were under 
18 hr. food deprivation at the beginning of 
each daily run. They were fed 4 hr. later 
for a 2-hr. period. Water was always available 
in the cages. 

The Ss were given one trial per day of 
pretraining on the straight alley runway. 


Abbaratus.—The T maze was a replica 
of Ht used by Galanter and Bush (1959). 
It consisted of a straight alley runway for 
pretraining and a T maze for the main experi- 


e T maze was built in such a way 
EE ocber and the start arm of the 
T could be separated and a goalbox could be 
hooked to the stem of the T, thereby changing 
the maze into a straight runway. The maze 
was built of plywood with a removable wire 
mesh top and pressed wood doors. The 

d the attachable goalbox 


inside of the stem an! 1 
were painted medium gray, the right arm was 


i ivht gray, and the left, dark gray. 
AE of Ln cross arm was 60 in., the 
length of the stem Was 26 in., and the attach- 
able goalbox was 10in. The alleys were 4 in. 
wide and the walls were 8 in. high. The 


starting compartment was 10 in. long with 
a guillotine door on the maze side and a 
hinged door on the outside. Another guillo- 
tine door was at the choice point. The goal 
cups were placed at the end of each arm. 
The metal goal cups had double floors, the 
bottom part contained inaccessible wet food 


Pretraining lasted for three days. 

During the 30 days of Period 1 of the T- 
maze learning, the following procedure was 
adhered to: .038-gm. pellet was deposited 
in the right goal cup; nothing was placed in 
the left goal cup. The S was placed in the 
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startbox and the startbox door was raised. 
As S passed the choice point, its door was 
lowered. The S was left in the maze until it 
ate the pellet, until it investigated the goal 
cup (on the nonrewarded side), or until 3 
min. were up, whichever occurred first. 

At the end of Period 1 Ss were divided at 
random into two groups. One group was 
always rewarded on the left side during Period 
2, and the other was rewarded according to the 
following schedule obtained from a random 
number table with P(L) =0.15: LLL R 
LLRLERCLELLCLELRLELLER 
LLRRRRLLLLL. Period 2 lasted for 
35 days. 

Estimation of parameters.— The parameters 
of the alpha model were estimated in the 
following way: The initial probability pi was 
taken to be 1.00. The other two parameters 
were estimated from the Period 2 data of the 
100:0 group by equating the observed mean 
number of trials before the first success and 
the observed mean total number of errors 
to their respective expected values. 

Initial estimates of the beta model param- 
eters were determined by methods similar 
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to those used for finding the alpha model 
parameters. These estimates were modified 
by exploration of the parameter space until 
the response probabilities (Monte Carlo 
computations) were similar to the experi- 
mental data of the 100:0 group. The follow- 
ing criteria were used: the total number of 
errors generated by the model had to match 
the data, and a plot of trial-by-trial mean 
response probabilities produced by the model 


had to appear similar to the corresponding 
plot of the data. 


RESULTS 
The results of the experiment are 
summarized in Fig. 1 and 2 and Tables 
1 and 2. Figure 1 presents the 
proportions of R response of the 100:0 
group during Period 2 and the cor- 
responding curves generated by the 


models. Figure 2 depicts the same 
data for the 75:25 group during 
Period 2. Tables 2 and 1 give 
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FiG. 2. Period 2, Group 75:75. 


Trial by trial proportions of L responses made by 25 


experimental Ss (filled circles); by 100 alpha model Monte Carlo analogs (open circles) com- 


puted with pi = 


(triangles) computed with pi = 0.97, Bi = 0.952, and B: = 0.647. 


1.00, «1 = 0.858, and az = 0.955; and by 200 beta model Monte Carlo analogs 


(R = food reward is in 


right maze arm, otherwise the left arm is baited.) 


comparative results of this experiment 
and corresponding model values. A 
more detailed analysis of results is 
Presented by Fey (1960). 


DISCUSSION 


The merit of a mathematical model 
Of learning lies not so much in describing 
the data of any one experiment with the 
aid of parameters estimated from that 
Particular experiment as in its ability 
to represent accurately the learning 
Process of a variety of different experi- 
mental situations using the same set of 
Parameters. In other words once the 
Parameters are cstimated for one experi- 
mental situation the model should be 
able to predict the course of learning 
in other experiments. Models which 
Will handle a variety of experimental 
situations with the same set of parameters 
are called parameter invariant. 


This experiment indicates that the 
models under consideration fit the Period 
2, 100:0 group data, from which their 
parameters were estimated, quite well, 
but the fit to the Period 2, 75:25 group 
data (using parameters estimated from 
the Period 2, 100:0 group) is less success- 
ful. Both models show an apparent lack 
of parameter invariance of approximately 
equal magnitude. 

Tables 1 and 2 might give the impres- 
sion that the alpha model fits the data 
slightly better than the beta model. 
This conclusion is hardly warranted if 
the magnitudes of the differences and the 
methods of estimating the parameters 
are considered. The alpha model param- 
eters were determined analytically; those 
of the beta model were estimated by 
Monte Carlo procedures. Thus the 
alpha model parameters were determined 
more exactly than those of the beta model. 
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The lack of long runs seems to be a 
basic difficulty of the models. Thisis of 
little consequence in 100:0 animal learn- 
ing, but it does seem to be important in 
partial reinforcement schedules for ani- 
mals as well as in human choice behavior 
(Derks, 1960). This lack of long runs is 
not generally manifested in mean learn- 
ing curves, but only in a sequential 
analysis of the data. 

The fact that the 75:25 ‘stat rats" 
learn more slowly than the experimental 
Ss is not as serious as the lack of long 
runs. A change in the size of the model 
parameters will correct the former de- 
ficiency. The data indicate that by 
reducing the beta model parameters 
by about 25%, the total number of 
errors made by the model analogs will 
match those of the experimental Ss 
for the 75:25 group. These reduced 
parameters decrease the fit to the 
100:0 group. 

The slow learning of the 75:25 model 
analogs could be handled by specifying 
the manner in which the parameters are 
modified when the schedule changes from 
100:0 to 75:25. With respect to the 
lack of perseverance, no small change in 
parameter values would increase the 
fit of model to data. 

Galanter and Bush (1959) noted in 
three of their experiments that the 
probability of turning to the more fre- 
quently rewarded side tended to decrease 
slightly during the first few acquisition 
trials before it began to rise. This 
Phenomenon occurs also in other experi- 
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FIG. 3. Trial by trial distribution of time 
spent in baited. left (filled circles) and un- 
baited, right (open circles) maze arm by 20 
Ss of Galanter and Bush (1959) Exp. II. 
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mental situations (Gibson & Walk, 
1956; Jensen, 1960; Kendler & Lach- 
man, 1958). In the present experiment, 
the initial dip is hardly noticeable. 3 
A look at the time the rats spent in 
the baited and in the unbaited arms of 
the maze (Fig. 3 and 4) indicates that 
initially our Ss and those of the Galanter 
and Bush (1959) Exp. TIT were removed 
more quickly from the unbaited than 
from the baited side of the maze; later 
in the experiment, removal occurred 
after approximately the same time 
interval in either arm of the maze. 
The reason for this is found in the criteria 
for removing S from the maze: S is left 
in the maze until it investigates the food 
cup on the unbaited side, until it eats 
the pellet on the baited side, or until 
3 min. are up, whichever occurs first. 
The Ss investigate the food cup on the 


‘* The data plotted in Fig. 3 were obtained 
from the Original protocols of the experiment 
reported by Galanter and Bush (1959). 
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baited side before they start eating the 
pellet. In fact, S may take a pellet in its 
mouth, drop it, and not eat it; thus 
investigation of the food cup occurs 
before the eating (Fig. 3 and 4). 

Should S initially prefer removal from 
the maze to eating the pellet, the non- 
rewarded side may actually be more 
attractive to S than the rewarded side, 
since S has to stay in the nonrewarded 
side for a shorter period of time. The Ss 
behave initially as if they were much 
more interested in escaping from the 
maze than in eating. As 5S becomes 
accustomed to the experimental situa- 
tion, interest in escaping decreases and 
the food pellet gradually becomes more 
attractive. 

This explanation can handle the dip 
in our experiment, but it fails in the 
case of other experiments such as a 
Skinner-box situation. 


SUMMARY 


This paper investigated two models for 
learning: a linear model proposed by Bush 
and Mosteller and a nonlinear model devel- 
oped by Luce. Specifically, an attempt was 
made to determine whether data obtained 
from two different experimental situations 
could be described by the models without 
changing the parameters. 

‘Iwo groups of rats were trained in a T 
maze. One group was always rewarded with 
food on one side; the other group received a 
food reward with probability .75 on one side 
and .25 on the other side. Model statistics 
were computed for both groups, using param- 
eters estimated from the group that was 
always rewarded on the same side, and 
compared with the experimental data. 

It was found that there is good agreement 
between the models and the data of the 
continuously reinforced group, from which 
the model parameters were estimated. The 
fit to the data of the partially reinforced 
group, however, leaves something to be 
desired. 

Both models fit the data about equally 
well. 


359 


REFERENCES 


BusH, R. R. Some properties of Luce's beta 
model for learning. In K. J. Arrow, 5S. 
Karlin, & P. Suppes (Eds.), Proceedings of 
the first Stanford symposium on mathe- 
matical methods in the social sciences 
Stanford, Calif.: Stanford Univer. Press, 
1960. 

BusH, R. R., GALANTER, E., & LUCE, R. D. 
Tests of the ‘‘beta model." In R. R. Bush 
& W. K. Estes (Eds.), Studies in mathe- 
matical learning theory. Stanford, Calif.: 
Stanford Univer. Press, 1959. 

BusH, R. R., & MOosSTELLER, R. Stochastic 
models for learning. New York: Wiley, 


P. Human binary prediction and 
the “conditioning axiom'" under temporal, 
incentive, contingency, and experimental 
variations. Unpublished doctoral disserta- 
tion, University of Pennsylvania, 1960. 

Fey, C. F. Parameter invariance in models 
for learning. Unpublished doctoral dis- 
sertation. University of Pennsylvania, 1960. 

GALANTER, E., & BUSH, R. R. Some T-maze 
experiments. In R. R. Bush & W. K. 
Estes (Eds.), Studies in mathematical learn- 


ing theory. Stanford. Calif.: Stanford 
Univer. Press, 1959. 
GIBSON, E. J., & WALK, R. D. The effect 


of prolonged exposure to visually presented 
patterns on learning to discriminate them. 
J. comp. physiol. Psychol., 1956, 49, 239- 
242. 

JENSEN, G. D. Learning and performance 
as functions of ration size, hours of priva- 
tion, and effort requirement. J. exp. Psy- 
chol., 1960, 59, 261-268. 

KANAL, L. Analysis of some stochastic 
processes arising from a learning model. 
Unpublished docotral dissertation, Univer- 
sity of Pennsylvania, 1960. 

KANAL, L. On a random walk related to a 
nonlinear learning model. IRE Nat. con- 
vention Rec., 1961, in press. 

KENDLER, H. H., & LACHMAN, R. Habit 
reversal as a function of schedule of rein- 
forcement and drive strength. J. exp. 
Psychol., 1958, 55, 584-591. 


Luce, R. D. Individual choice behavior: 
A theoretical analysis. New York: Wiley, 
1959. $ 


(Early publication received December 6, 1960) 


A FUNCTIONAL EQUATION ANALYSIS OF TWO LEARNING 
MODELS* 


LAVEEN KANaLt 
GENERAL DYNAMICS/ELECTRONICS 
ROCHESTER, NEW YORK 


One-absorbing barrier random walks arising from Luce’s nonlinear 
beta model for learning and a linear commuting-operator model (called the 
alpha model) are considered. Functional equations for various statistics are 
derived from the branching Processes defined by the two models. Solutions 
to general-functional equations, satisfied by statistics of the alpha and beta 


models, are obtained. The methods presented have application to other 
learning models. 


The two-response, two-event, path-independent, contingent version of 
a number of stochastic models for learning is given by the equations 


0) EA Qip, with probability p, 

Qspa with probability (1 — p,), 
where Q, and Qs; represent transition operators, and p, and (1 — pi) are, 
respectively, the probabilities of responses A; and A; on trial n. A linear 


model discussed by Bush and Mostelle 


r [8] is obtained when the operators 
in (1) are defined by the equations 


(2) Qips = ap, OES I, 
QP, = ap, (0 < a2 SE 1). 


In this paper, this linear model is called the “alpha" model. A specialization 
of the nonlinear “beta” model proposed by Luce [13] is obtained when the 
operators are defined by the equations: 


2 Bip, Ee 92. * D: 
(3) QD Te: = Et 123 Bi; 3; Opi 1. 


In terms of the variable v, = p./(1 — p,) the transition equations for this 


*Abstracted from portions of the author's doctoral dissertation, University of Penn- 
sylvania, June 1960. The author is indebted to Robert R. Bush, his dissertation supervisor 
for the valuable help and encouragement received from him and to R. Duncan Luce for 
many helpful discussions and for partial support from an NSF grant. { 

tFormerly at the Moore School of Electrical Engineering, University of NERY vanin, 
Philadelphia, Pa. The author is grateful to J. G. Brainerd, S. Gorn, and C. N. Weygan. tb 
of the Moore School, and N. F. Finkelstein, D. Parkhill and A. A. Wolf of General Dynamics 
for their encouragement. 


This article appeared in Psychometrika, 1962. 27, 89-104 Reprinted with permission 
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Version of the beta model are 
LS Bit with probability pa. 
FL 
Balu with probability (1 — pi), 


(4) 


Where 
0 < Eo B: > 0; Ye 2, 


In the beta model response probabilities undergo nonlinear rather than 
linear transformations from trial to trial. Since the probabilities of choice 
inevitably enter into the derivation of stochastic properties of the model, 
the methods generally used to derive properties of linear learning models 
do not apply to the beta model. 

Analytical methods applicable to both the alpha and beta models are 
presented in this paper. The approach used is to consider the branching process 
defined by the decision rules of the two models, and from it to formulate 
functional equations for various statistics of interest. Tatsuoka and Mostel- 
ler [15] used a functional equation approach to obtain some statistics for 
the alpha model. Their techniques differ somewhat from those presented 
here; the approach developed here leads to a unified method of attack for 
the alpha and beta models and can be extended to others. 


Some Random Walks Arising from the Beta AIodel 

In (4), 8B; > 1 and B; < 1 may be identified, respectively, with reward 
and nonreward of the response. If response i; is never rewarded and response 
42 is always rewarded B, < 1, 82 < 1. If both responses are always rewarded 
B, > 1, B> < 1. If neither response is ever rewarded B, < 1, 8B» > 1. It is 
Shown in [11] that these three cases lead to one-absorbing-barrier (OAB), 
two-absorbing-barrier (TAB), and two-reflecting-barrier (TRB) walks. Rig- 
Orous proof of the nature of the barriers for these and other random walks 
resulting from the two-alternative, two-outcome beta model is given by 
Lamperti and Suppes [12]. Only the OAB beta model (8B, < 1, B» < 1) is 
consideréd in this paper. Except for the case when a; = 1, in the alpha 
mode] either response diminishes the probability of response A, ; the alpha 
model is a one-absorbing-barrier model. 


Functional Equations for Statistics of the 
One-Absorbing Barrier Models 


‘The OAB alpha and beta models lead to an asymptotic distribution of 
Dn Which has all its density at © = ©. (Considering response A; as an error 
On the part of organisms which are learning, this means that all organisms 
eventually learn not to make errors). Additional information about the 
Processes is obtained from various statistics. Following the work of Bush 
and Sternberg [9] on a simple single-operator model, the statistics considered 
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are those which describe the rate of approach to the asymptote, such as the 
mean, weighted mean, and variance of the rate of approach; sequential 
statistics concerning runs of responses; other statistics, such as those de- 
scribing the first occurrence of an As» response (success) and the last occur- 
rence of an A; response (failure). Functional equations satisfied by these 
Statistics are derived by considering the branching processes shown in Fig. 1. 
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Jor the analysis which follows, a sequence 2; , 12, *** , 2, Of random 
variables is defined such that 


1 if response A, occurs on trialn 
rE > ষ 
0 if response A» occurs on trial n. 


The random variables have expectations pi, . 


The mean number of A; responses 

In terms of the random variables x, , the total number of A, responses 
in N trials is given by the random variable Xx = SOY, a, with expectation 
E(X») = YOY, pa . In the one-absorbing barrier models, both responses 
decrease the probability of response A; and 

E(X) = lim E(Xx) 
N-D 

is of interest. In fact, by replacing the parameters of the models by 8 = 
max (8B, , B:) and a = max (a, , a2) finite upper bounds for E(X) in the two 
models are obtained. Now the number Xs of A, responses in N trials starting 
from trial 1 will be equal to the number, Xx-; , of A; responses in (N — 1) 
trials. starting from trial 2 if the result of trial 1 is an A» response and be 
equal to 1 + Xw-; if the result of trial 1is an A; response. Letting ¢ denote 
the expected number of Ai; responses, the functional equations for ¢ are 
Obtained from Fig. 1 to be 


#0, N) = Dll + 6B, NV — D] + (1 -— pits, N — 1) 
" +t l IE 
TVET (bv N-DEFU+TE, SE 7 $B, N — 1), 
and 
$.(p, N) = Dloalaip, N — D + 1] + (1 — P)talasp, N -— 1). 
When N — = these equations become 
te i Se SE CEE 
(5) HUTTE #8) tT) tL 
(8) 6a(p) = Pealoip) + (1 — DP)Galoap) + DP. 
Both the above functions must, of course, satisfy the boundary condition 
%(0) = oO. 


The second moment of the number of A; responses 

Letting 8 denote E(XS) the functional equations for the second moment 
are then, as N — =~, 
1 


(7) 630) = TE BABE) “FF ES 


ji 
65(B2) + NEST [1 + 248(8,0)], 
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8) 9.(p) = pe.(aip) + (1 — p)0.(asp) + PLL + 26.(oip)]. 


9(0) = 0 is a boundary condition. Finite upper bounds exist for 03(v) and 
0.(p); replacing the parameters by B = max (B, , B») and « = max (a, , a2) 
the variance of Xy is XX. pA(l — pa) which remains finite as N 2 = if 
YY p, does. Functional equations for higher moments are easily obtained 
in this manner. 

The functional equations for the mean and second moment of the 
number of A; responses have been previously obtained by Tatsuoka [14] 
and Tatsuoka and Mosteller [15]. Their method of derivation is somewhat 
different from that presented here. 


The weighted number of A; responses 


Define the random variable 


= > (Mm + o)z2, . 


Then Yo,.x represents the weighted number of A; responses in N trials with 
the weighting function being the trial number 1. From trial 2 on, the weighted 
number of A; responses is BB n%, , Which by relabeling the random variables 
22,03," A882, ,T2, Can be represented by the random variable 


N-1 
Yawn ica by (nm + l)r, = Yoni TX yr 
n=l 


If y stands for the expectation of the weighted number of A; responses, 
the functional equations are obtained by noting that Yo.x is equal to Yi .w-i 
if the result ‘of the first trial is an A, response and is equal to (1 + Yi ,w-u) 
if the result of the first trial is an A; response. For an infinite number of trials, 


(9) Yelv) = 7 Yell, v) WEE 7 Ye(B2) + oa) 


(10) Ya(P) = PY. ey +0 - Re + (DP). 
A boundary condition is y(0) = 


Number of trials before the first A; response (success) occurs 


Let F, + 1 denote the trial number on which response As» occurs for 
the first time so that F, is the number of trials before the first As» . Fi; is 
equal to zero if As occurs on the first trial and is equal to (1 + F;), where 
F, denotes the number of trials, before the first A; response occurs, starting 
at trial 2, if trial 1 results in an A; response. Letting v denote the expectation 
of the random variables F, the functional equations for v are 


LL ve0) = pils(B) + 1 + [01 -—Dpi)0] = I ৰ +7 (8) == j ন ট ; 
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(12) va(p) = Dvalaip) + DP. 
If p denotes the second moment of the random variables F, the functional 
equations for p are 


(03) ost) = TT UT 280) + pa(8o)] 


= TH + TET + 9B); 


(14) palp) = Dpalan ,D) T+ PIL + Waip)]. 
Trial number at which last A, response occurs 
Let 


0 if no A; response oceurs on, or after trialn 
L, = 41 if the last A, response occurs on trialn 
(N + 1) - nif the last A; response occurs on trial N>n. 
Then the random variable Li, represents the trial number at which the last 
A, response occurs, and by definition Li is zero if no A, response occurs 
on any trial. In the following development, the sequence of responses AAA, 
denotes the occurrence of As, on the first trial followed by As» on the second 
trial and by A, on the third trial. It is evident that 
Ls» + 1 if Ai occurs 
Ee + 2 if As2Ai occurs 
Lh =, +3 if AsA2A, occurs 
E + 4 if A2A2A2A, Occurs 
and so on. 
ation of the random variables L, the functional 


Letting u denote the expect. 
Fig. 1. For an infinite number of trials 


equation for ua is developed from 
halp) = pPlualan) FU +O- D)axplualaiazp) + 2] 
+ (1 — D)(L — axp)oiplualaiasp) + 3] + 
= philap) + pF 0 — pop +O D)(1 — axplasp + 
+ [0 — plosp + 201 — DU — axp)osp 
+ 30 - DO - ap) — azp)lap + ‘°- + (1 — D)axpualaiap) 
+ (1 — DCL — oxp)aipue(ioxp) +). 


But the term in brackets in the last expression is just (1 — DP)ualasp) as 
may be deduced fron: the expression for ua(p). Also 


p+(1-poeapt+(-Dpl ~ apap + --- = 1 -— I (1 — ax2p). 
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A similar development for ua(v) results in the functional Ms 


(15) ust) = J ET ua(Bid) FTES nas(B2) + 1 — I Ps EE 


(16)  nalp) = Dualaip) + (1 — DPualasp) +1 — Il (1 — ap), 


with u(0) = 0, since for p = 0 no A; response ever occurs. A different deriva- 
tion for the mean of the trial number at which the last A, response occurs in 
the alpha model will be found in Tatsuoka and Mosteller [15]. 

For the expectation of Li it is necessary to consider the expectations of 
(Ls + 1)°, etc. Denoting the second moment of the random variables L 
by + the functional equations for Y are 


(7) 20) = THB) + TE B) + [ 20) +i - i 


EL) 


and 


(18) Ya(p) = PYalaip) + (1 — D)Yalasp) + [ 2.0) + I (1 — ap) = : চ 


with 7(0) = 0. Functional equations for higher moments of Li can easily 
be generated in the above manner. 


Number of runs, of length j, of A; responses 
The sequence of responses 
Adi <5 Aids 
J trials 


is termed a run, of A; responses, of length j. Statistics concerning the number 
of runs of Ai; responses of length exactly equal to j, and of length greater 
than or equal to j (J = 1,2 ...), are of interest. Let R,., denote the number 
of runs of length j, which occur between trial n and the termination of the 
process. The total number of runs of length jis then R,.; . From the branching 
process of Fig. 1 it is seen that Ri, = R,,; + 6,2, Where Bi feee 39: The 
Kronecker delta function. Letting c; denote the expectation of the numbe 

of runs of length j, the functional equation for c,s is developed from the beta 
model lattice of Fig. 1(a). For an infinite number of trials 


cis) = 2 (I) — Pesi)lois(BiBw) + 5;.,], 


=U 
where 6,., is the Kronecker delta function. Substituting c,s(8,") for part of 


the expression gIVes, 


ced) = Picisl(Biv) T+ (1 — Dpi)cis(B») + Ll PAC =p) = pill = De 
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A similar development gives the functional equation for G;a(p). The 
functional equations are 


(19)  oi,alv) = ET cis(8B0) + ত 0 ;5(B820) 


Hen): - a): 
ti Il Bev 1+"! 


il 


(20) c;a(p) = Do;alaip) + (1 — Dp)c; lap) 
+ a, UD p'( -— 2aip + ai ™'p°), 


with ,(0) = 0. 


Number of runs of length greater than or equal to j 

Let T,,; stand for the number of runs of length greater than or equal 
to j of A; responses, which occur from trial n to the termination of the process. 
Then 7, ., denotes the total number of such runs. Now, for an infinite number 
Of trials 

Ty = Tag Tt Bit C= 25,9) +574); 

Letting A; be the expectation of the number of runs of length > j, a 

development similar to that previously outlined gives 


i+1 


(21) Malt) = DAa(Bid) + (1 — PAB) + (1 — pia) IH D 


v te TTT BLS 
=e + TMB) + Urs 
iS 2! LBs 
(22) Aap) = PAialaip) + (1 — D)Aalasp) + (1 — apa SD. 
The expectation of the total number of runs of A; responses in an infinite 
number of trials is obtained when j = 1. Denoting this statistic by X, 
(23 = MB) +E MBD) + LTT 
I el EE Ten. 1+v(14+28v)' 
(24) Aa(p) = DAalaip) + (1 — D)Aalazp) + PO — ap), 
with A(0) = 0. Additional functional equations for other random variables 
Of interest, such as runs of A; responses, have been derived in [11]. 
General Functional Equations for the One-Absorbing-Barrier Models 


The functional equations presented for statistics of the beta model 
have the general form 
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) 1 
(25) f,B,,B2) = চা (Bw, B, , B2) + Iie, Bi; 82) + 90, BiB) 


where 
QUSUKA OVUSRHESL MSH ET 


The term 9(v, B; , B2) is, in general, different for each statistic considered. 
For all except the run statistics 


90,8, ,6B2)> 0, 
(26) 9(0, 8: , B:) = 0, 
lim dG; Bi 58) 2 ls 


For these statistics 
(0,8, ,8:) = 0, 
lim fo, 8, ,B) = =. 


(27) 


Equation (26) does not hold for the run statistics and the boundary conditions 
for the run statistics have to be defined separately. 

The functional equations for statistics of the alpha model are seen 
to have the general form 
(28) yp, a, , 02) = pylaip, a, , 02) + (1 — D)Ylasp, a, , a2) +H 2(p, a, , 02) 
where 

COE PS hs 0S al, 0 <a Stl. 
For the statistics of the alpha model 
(29) 2(0, a, 102) = 0, 
2(1, a, 2) > 0, 

and the boundary conditions for all the statistics considered are 
(30) Y0,a, ,a2) = 0 
and 


lim YP, «, , 2) is finite. 


The functional equations for the run statistics of the beta model differ 
in nature from the functional equations for the other Statistics considered. A 
discussion of the functional equations for the run statistics is presented in [11]. 
‘The sections which follow present formal solutions to (25) and (28) 
under the boundary conditions (27) and (30) respectively. Theorems con- 
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cerning existence, uniqueness and other properties of the solutions have 
been proved in [11] by methods similar to those of Bellman [3]. Some of these 
theorems are stated here without proof. 
On the functional equation for the 0A B beta model 

Writing f(v, Bi , 8:2) simply as f(1), (25) takes the form 


(31) 10) = - (8,0) + - (820) + 90), 


Ls 1 
I+ 1+0 
Where 


91) > 0, 9(0) = 0, lim 90) > 1, 


and 


1(0) = 0; lim f0) = =. 


Further, let 0 < 8, < 1;0 < B:, < 1. The cases (B, =1, B: < 1) and (8B, < 1, 
B2 = 1) can be considered separately. 
Existence of solution. Tor any function r(1) define the operator T by 


(32) Tt) = TIT + TED + 90. 


ER PE 
Lt 
THEOREM 1. 

10) = lim T™g0) 


when the limit exists. 


THEOREM 2. If g(t) is a monotone increasing function of v, then a solution 
IQ) exists if 


XX 9(8'v) 


0 
1s finite for 0 < v < oo, where0 < B= max (8,82) <1. 
As almost all the g(v) occurring in the beta model first-moment equations 
are monotone increasing functions of v which satisfy the conditions of Theorem 
2, the existence of the mean of most of the random variables introduced for 


the OAB beta model is assured. 
From a proof similar to that for Theorem 2 it follows that when g(v) 


iS a monotone increasing function of v, 

(33) YD 9080) < 10) < XS 908), 
i=0 1=0 

Where 


Bn = max (8B, ,B:2) and B,= min(8,, Ba) 
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Continuity. If | g(t) | is bounded in 0 < v < o, the solution f(v) is 
continuous. 

Monotonicity. If 90) is a monotone increasing function of v, and if 
B, > Bs, then f(v) is a monotone increasing function of v. 

Uniqueness. The solution f(v) is unique n 0 < Vv < oo. 


On the functional equation for the OAB alpha models 
lor the functional equation 
(34) Y(p) = B-yYlaip) + (1 — Dp)Y(axp) + 2p) 


the development of existence, uniqueness and other properties of the solution 
is similar to that for (31). Some properties of y(p) are stated without proof. 
Existence. For any function Q(p), define the operator 


(35) AQ(p) = pO(laip) + (1 — p)Q(asp) + 2(p) 
and let 
Ld AAD) la = cla 00) 


THEOREM 3. 


Y(p) = lim A .2(p). 


THEorEM 4. If 2(p) 1s monotone increasing in p, then 


Xap) < YP) < XS (ap) 
where 
Qw = MAK (or; , 02), an, = min (a, , a2). 


Monotonicity. IH z(p) is monotone increasing in p, and a; > as , then 
Y(P) is monotone increasing in p. 
Convexity. Tf z(p) is convex and a, > as, then y(p) is convex. 


Solution of the Functional Equation for the 0A B Beta Model 


The solution to (31) is obtained by generalizing from solutions of the 
equation for special parameter values. The parameter space of the OAB 
beta model is shown in Fig. 2. One solulion for special parameter values is 
derived here. A detailed presentation will be found in Ll. 


TneorEM 5. Along sides (1) and (2) of Fig: 2, 


(30) 0) = > > sere Ti EE 
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[| (3) [খু 
(4) (2) 
B2 
(6) (1) | 
B if 
FIGURE 2 


Parameter Space of OAB Beta Model 


Proor. Along side (1), 8B: = 0, 8; < 1, and only the n = 0 term of the 
summation over m, iS nonzero. The resulting expression is the one obtained 
from the functional equation, for in this case 


0 = SE (Bw) + gO), 
giving 


ি Biv He ৰ 
(8) টি a+ Bro) i(B: ) 9(Bi0), 


form = 0,1, --- , from which the desired result is obtained by Successive 
substitution. Along side (2), B, = 1, 8; < 1, and (36) becomes 

5 3. 
2 (8) 2 LH (OES TE XY 0+ 8%)9(8%), 


n= 


Br = 
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the last expression being also the one obtained from the functional equation 
for the case B, = 1, B> < 1, for which case the functional equation reduces 
to f0) — (Bx) = (1 + v)g0). Q-E.D. 

Note that at the point (1, 1) of the parameter space the solutions diverge. 
By letting B, = Bi, B: = B: (k = 1,2, ---), solutions along arcs of the form 
(6) and (7) of the parameter space can be obtained. The resulting functional 
equations may be written in the form of q-difference equations for which 
there exists an extensive body of literature [1, 2]. 

Examination of the solutions for various special parameter values sug- 
gests the form of the general solution. The general solution to (31) is given 
by the following theorem. 


THEOREM 6. 
I) = 2 2 nal) (BYE). 
where 
Ao.old) = li 
 Anolt) = | EEE toe Lt, 
m.0 জর) 1 Es Bi Ry = 7) ) 
é j 
Asn) = IH IF (in = Ly By 
Adnnlt) = Pat Ao.xl0) Ai ol B20) A mcr .n- (BBY) (m,n=1,2,"- '). 


PRooF. Substitution in (31) gives 


2 XS A n.nlt)g(BTBY) = jg mt 2» 22d A .n(B0)9(BT BEV) 


me0) n=) 


m.n(B20)9(BY'B2 0) + 900) 


(87) YX 4n.ol0g(BTD) = gt) + TT ৰ A m-1.0(B,0) (BID), 


m=) 
which gives 
v m Bi’ 


Aso) = LT; Aa) = EE An-1.0(B) = I= ক হত 
i=l 1 


(38) 2 do.nlt)g(B) = কত; 22 Ao.n-1(B20)9(8%), 


LAVEEN KANAL 373 


Which gives, 


1 1 
Ao.) ্ঃ I+ Ao.o(B20) Land Iz 
EO 1? jy TT = 
Ao .n(0) = +v Ao.n-1(B20) LH (CEE Bi) , 
and 
689) XX Aun)g(BTe) = TET 2 2 tecrnlBi0) 90878) 
+75 2 2 An-8)9(676%). 


The coefficients in this last expression satisfy the difference equation 


Anat) = TET Ancient) + TET Ann, 


from which follows [11] 


Ait) = 52 Aol) di a(B) Ao.nc(B,8i0) 


n-k 


BY TT 1 1 
> 1+ Bw H (1+ BY) I (1+ BBD)’ 


k0 f= 


Anal) = DS Aoxlt)dislei)dninlfiB). QED. 


General Solution to the 0A B Alpha Model Functional Equation 


Replacing B: and B, by a? and «a, in Tig. 2 gives the parameter space of 
the OAB alpha model. The general solution for (34) can be derived [11] in 
a manner similar to that used for the beta model functional equation. The 


Solution is given by 


THEOREM 7. 


yp) = YD YD bunlp) -2(atatp), 


m=O) n=0 
Where 
bo .o(p) = I 
bsp) = pan nm) (= 1,2, -), 
bp) = TIO - 2D) = 1,2, ...), 


baal) = 2, b; olasp) bo. (Pp) bm-1.n- (02D) (ns = 1, 2, dd; 
k=0 
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Proor. The proof is similar to that used for the beta model equation. 
Details are given in [11]. 


Discussion 

Analytical techniques applicable to a class of learning models have 
been presented. Functional equations for various statistics of two learning 
models, viz., Luce’s nonlinear beta model and a linear commuting-uperator 
model called the alpha model, have been derived from the branching processes 
defined by the models. 

The results on stochastic properties of Luce's beta model are new. Jor 
the alpha model, power series solutions to the functional equations for the 
first and second moments of the total number of A; responses and the trial 
number at which the last A; occurs had been obtained by Tatsuoka and 
Mosteller [15]. However, the techniques of expanding the functions in a 
power series in the variable often fails, as is illustrated by the fact that the 
power series solutions (obtained by Tatsuoka [14]) to the functional equations 
for the first and second moments of the total number of A; responses for 
the one-absorbing-barrier (OAB) beta model are not valid for v > 1. 

By investigating two general equations, the problem of solving the 
individual functional equations for the OAB models was simplified. The 
functional equations for the sequential statistics of the OAB beta model 
do not have the same boundary conditions as the general equation presented 
in this paper, and their solutions require additional investigation. i 

Because of the complexity of the expressions obtained for the statistics 
of the OAB models, an attempt was made to find some close bounds which 
could be easily computed. Some upper and lower bounds for statistics of 
the OAB alpha model have been presented in ([11], ch. 5). An upper bound 
for one statistic of the OAB beta model has also been derived in [11], mainly 
to illustrate the methods used to obtain upper bounds for a few statistics 
of the OAB beta model. These methods failed for a number of the statistics. 
Furthermore, a method for the derivation of close lower bounds for the 
OAB beta model remains to be found. 

Impirical tests and comparisons of the beta model with other models 
have been presented by Bush, Galanter, and Luce [6] and Fey [10]. The 
use of statistics such as those derived in this paper for the estimation of 
parameters and for measuring the goodness of fit has been discussed by 
Bush and Mosteller [8], Bush, Galanter, and Luce [6] and by others (see [5])- 
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THE ASYMPTOTIC DISTRIBUTION FOR THE TWO-ABSORBING- 
BARRIER BETA MODEL* 


LAVEEN KANALT 
GENERAL DYNAMICS/ELECTRONICS 
ROCHESTER, NEW YORK 


For the two-absorbing-barrier specialization of Luce’s beta learning 
model, the asymptotic distribution of the response probability has all its den- 
sity at p = 0 and p = 1. The functional equation for the amount of the 
density at p = 1 is Investigated in this paper. 


Luce’s beta learning model [5] for the two-response, two-event, con- 
tingent case is given by the transition equations 


ITE 


UN pr ES Ti with probability p, ET EE 2. 


Bx, With probability 1 — pn 


where pn and 1 — Dp, are respectively the probabilities of response A: and 
response As, and where v, = p./(1 — pi). In a companion paper [3] Statistics 
for the one-absorbing-barrier (OAB) beta model obtained when Bi < 1, 
B> < 1, are derived. In this paper a statistic for the two-absorbing-barrier 
(TAB) beta model arising when B, > 1, B» < 1 is presented. Some statistics 
for the two-reflecting-barrier beta model are considered in [4]. 

For the two-absorbing-barrier beta model the asymptotic distribution 
of p, has all its density at p = 0 and p = 1. The amount of the density at 
P = lis a useful statistic for these models. If f(v) is the probability that a 


“‘particle'’ starting at v is eventually absorbed at + i, lies, St p= L, the 
functional equation for f(v) is 


ets SEM os i ্‌ 
(2) 10) = Tap (Bw) + TE (82), 
where 


MEH S EB Bh Id =t0 mits hk 


*Abstracted from a portion of the author's doct i i Jniversity of 
Pen ASS En TREN: The author is indebted to Prof. OR UE AA R. 
ters 1ssertation supervisor, for the valuable help and encouragement receive 
{Formerly at the Moore School of Electrical E 
sylvania, Philadelphia, Pa. The author is grateful to the Moore School for the support 


extended to him during his doctoral studies. He also wish i 
J. Fi ; fo e arkhill and 
N. Finkelstein of General Dynamics for their Df Sura E Caen STON On dat 


ngineering, University of Penn- 


This article appeared im Pxyehometnka, 1962. 27, 105-109 Reprinted with perinission 
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The solution of (2) is the subject of this paper. Existence, uniqueness, and 
monotonicity of the solution are shown in [4] by methods similar to those 
of Bellman [1]. 

Solution for the symmetric model 


Tor the two-absorbing-barrier symmetric beta model 


1 
B= 4&1: 
B: B; 
Let + = log, v and b = log. 8, . Then (2) becomes 
= EE WE: AE 
(3) =; দে a SEPT gn ze D). 


The solution of (3) is given by Theorem 1. 

THEOREM 1%, 
XS exp { 5 El Sis Bur) 
Xow 0+ no} 


Proor. From (3), letting g(x) = f(0) — fe — b), h(x) = log. f(x), 
one gets h(x) — h(n + Db) = x. Assuming A(t) = co + at + cot, and substi- 


tuting gives 
(0) = DOD) exp E «- 2 | 
Where p(x) is a periodic function of period b. As 
e+ 0-0 = +0. = etd - DA +. 
Then as 1 — =, f(x + nb) — 1 and 


I@) = 1— 2D 3 exp {- Es (A= 50} 


Iurthermore 


1 ন 
fe + nd) = 1- pl) YX exp {-3 ESA — vo} i 


and letting n — — ow gives 
! 


pa) = = ] 


Le 


*Prof. B. Epstein pointed out the error in taking limits in an earlier version of 
Theorem 1 presented by Bush [2]. 
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so that 
a 2 se (= or) 
fam) =1- 7 ঠ 
exp [-% 2+(0- 0) 
ee 4 
from which Theorem 1 follows. Note that f(0) = 3 as the symmetry of the 


problem indicates. Q.E.D. 
COROLLARY 1. When 7 — —%, i.e., for large negative values of x, 
2 [ 3 
(2) = pla) exp EE (x — b/2) | * 


a5 p(t) is of period b and the term corresponding to k = 0 dominates in the 
numerator. 


COROLLARY 2. For large positive x, 
So l 4 
0) = 1 -— pO) exp EE 2D 2 |. 
COROLLARY 3. When b < 4 the denominator of Theorem 1 is given by 
BLUE 
DE Nb? 


obtained by performing a fourier series analysis. 


CoROLLARY 4. Whenb <4, 


2s 1 =/VE 
eh ee -u?/2 
I= El. era, 
for then by Corollary 3, the denominator of Theorem 1 is closely approxi- 
mated by a constant and the numerator may be approximated by replacing 


the sum from zero to infinity by an integral from —1/2 to infinity. Using 
the transformation 


ODE Be L 
Ue [i ‘ | 
gives the corollary. 
Solution for the general TAB beta model 


For the general case 8, > 1, B> < 1, it is convenient to obtain the solution 
in terms of the solution for the Symmetric model. Let the solution 
for the symmetric model given in Theorem 1 be denoted by R(v). Then the 
solution for the general model is given by Theorem 2. 
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THEOREM 2: Forfi > 1 Bil 


{t= XY XX BAB IBABIBT RET BEY, 
(Bi) a6 220. Fab 


where 
C6) = Ta - s), 
2 EA 2 
Ha-s 
3) ete 


I (1-8;i'") 


B..(B:) 


’ 


B.(6i') = 


PRroor. Define the transform 
F(s) = Le fv" do. 
Writing (2) in the form 
(1+ 10 = few +; 160, 


and applying the transform gives 
FG) + F(s + 1) = BiF(s) + BF SF 1) 
If R(s) is the transform of RO), it is shown in [4] that 


Eo) Ls (a+1) 
FO) = RO II [ুল্ল্] ; 


Fl) 


from which, by expanding the numerator and denominator terms in the 
product, one gets 


FO) = সন 2 B.(87) 3 Bre XD Bu(B)BT BIRO). 


The inverse transform of 6B;**8:"R(s) being R(B; *Bzv), taking the inverse 


transform of F(s) gives Theorem 2. Q.E.D. J 
It is noted that the coefficients in the series of Theorem 2 tend to zero 


rather rapidly. 
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SOME RANDOM WALKS ARISING IN 
LEARNING MODELS I 


SAMUEL KARLIN 


Introduction 


The present paper presents an analysis of certain transition operators arising in 
Some learning models introduced by Bush and Mosteller [2]. They suppose that the 
organism makes a sequence of responses among a fixed finite set of alternatives and 
there is a probability py at moment n that response s will occur. They suppose further 
that the probabilities fy 0 are determined bythe pr, the response s, made aftermoment 
n, and the outcome or event r, that follows response s,. We shall examine in detail the 
one-dimensional models which occur in their theory. These models can be described 
in simplest form as follows: There exist two alternatives A; and As, and two possible 
outcomes r, and rs, for each experiment. There exists a set of Markoff matrices F;; 
which will apply where choice i was made and outcome r; occurs. Let p represent the 
initial probability of choosing alternative As, and 1 — p the probability of choosing 
Ai. Depending on the choice and outcome, the vector (p, 1 — p)is transformed by the 
appropriate F;j into a new probability vector which represents the new probabilities 
of preference of As and Aj, respectively, by the organism. The psychologist is interested 
in knowing the limiting form of the probability choice vector (p, 1 — Pp). 

The mathematical description of the simplest process of this type can be form- 
ulated as follows: A particle on the unit interval executes a random walk subject to 
two impulses. If it is located at the point x, then x — Fa = ox with probability 
1 — $0), and 2 Fz =l-0+ar with probability (a). The actual limiting 
behavior of x depends on the nature of é(r). The transition operator representing the 
change of the distribution describing the position of the particle is given by 


r/o Plt 4 SNE 


(FFG) = [1 -—$0)]dF + (1) dF. 


0 0 
We introduce an additional operator, acting on continuous functions, and 


given by 
Usz(1) = [1 — é(0)](or) + &OO — x + af). 


Tt turns out that Tis conjugate to U: hence knowing the behavior of U one obtains 
much information about T. This interplay shall be exploited considerably. The 
Operator Tis not weakly completely continuous nor does it possess any kind of com- 
Pactness property: thus none of the classical ergodic theorems apply to this type IS 
The limiting behavior of T"F depends very sensitively on the assumptions made 
about the operators F, and the probabilities é(.r). 


This article is from Pacific J. Math.. 1953, 3, 725 756. Reprinted with permission. 
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Section 1 treats the case where ¢(.f) = +. This causes the boundaries 0 and 1 
to be absorbing states, and thus the limiting distribution concentrates only at these 
points. However, the concentration depends on the initial distribution. By examining 
the corresponding U in detail, we have been able to obtain much additional knowledge. 
For example, we have shown that if #7 is mm times continuously differentiable then 
(Un) converges uniformly for each 0 <r < m — 1. Itis worth emphasizing that 
the knowledge of the convergence of the distributions does not imply the uniform 
convergence of U's for any continuous function #. Additional arguments are needed 
for this conclusion. In this connection, we finally remark that R. Bellman, T. Harris, 
and H. N. Shapiro [1] have analyzed only this case independently. They did not point 
out the connection between the operators Tand U. The methods they used to establish 


the convergence of T"F are probabilistic. Our paper in § 1 overlaps with theirs in some 


of the theorems, notably 6, 8, 9, 12, and 15; our results subsume theirs, and their 


proofs are entirely different from ours. Section 2 considers the case where (0) is 
monotone increasing and 


1&0) — HON <u<l. 


This leads to the ergodic phenomonon, or steady-state situation, where the limiting 
distributions are independent of the Starting distributions. 

In § 3, we examine the situation (x) = 1 — 2. This corresponds to completely 
reflecting boundaries, and of course the ergodic phenomenon holds. Other interesting 
properties of the operators are also developed. We consider in § 4 the case where ¢(.) 
is linear and monotonic decreasing. Section 5 introduces 
we allow the particle to stand still with certain probability. This type has been statis- 
tically examined by M. M. Flood [5]. In §6 we investigate the general ergodic type 
where #()is not necessarily linear. The arguments here combine both abstract analysis 
and probabilistic reasoning involving recurrent event theory. Furthermore, itis worth 
emphasizing, the proofs given in § 6 apply without any modifications to the case where 
we allow any finite number of impulses acting on the particle. In a future paper we 
shall present the extension of this model to the circumstance where changes in time 


occur continuously and the possible motion of the particle has a continuous or infinite 
discrete range of values. 


The last section studies some of the 
ergodic types. It is shown in 


a further possibility where 


properties of the limiting distribution in the 
all circumstances that the limiting distribution is either 
singular or absolutely continuous, and the actual form depends on the value of x + oc. 

Most of the analysis carries over to higher dimensional models where more 
alternatives are allowed. Ina Subsequent Paper we shall present this theory with other 
generalizations. We finally note that this Paper represents a combination of abstract 
analysis and probability; it is hoped that the methods used will be useful for future 
investigations of this type. 


It has been brought to my attention by the referee that the material of [6], [7], 
[8], and [9] relate closely to the content of this Paper. Their techniques seem to be 
different. 


1. A particle undergoes a random walk on the unit interval Subject to the follow- 
ing law: If the particle is at «, then after unit time +» —» % + (I — oa) with probability 
x, and + — or with probability | — zx, Where 0 < a, a <1. If Fle) represents the 
cumulative distribution describing the location of a at the beginning of the time interval 
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with the understanding that F(") s I for 2 > 1 and F(x) = 0 for + < 0, then the new 
distribution locating the position of the particle at the end of the time interval is given 
by ন 
rl Pr — DI 2) 
G0) = TF = (1 -— 1) dF(Q!) + t dF(1). (1) 
sl) «0 
Indeed, the probability dG() that after unit time the particle is located at x 
can materialize in two ways: namely, the particle was at x/o and moved with prob- 
ability I — x/6 tox, orit jumped with probability Ge — o)/(l — 2) from(e — ( -—- 2) 
to + during the unit time interval. This yields 


a U WE LY 
dG) = (৷ -£) aF{-) - ar ( ji 
G G 1-—- a Il -—- a 
Which easily implies the conclusion of equation (1). 
Equation (1) represents the transition law for the particular Markoff process on 
hand. 
The transformation Tis easily seen to furnish a linear bounded mapping of the 
Space of functions of bounded variation (Y) on the unit interval into itself. Further- 
more, T takes distributions into distributions and is of norm 1. This section 
investigates the behavior of T”" for large n with the aim of determining limiting 
Properties of T". 
We consider the following additional mapping U applied to the space of con- 
tinuous functions defined on the unit interval (0, 1): 


(Un)(!) = (1 -— t)a(ot) + ta[x + (1 - o)t]. (2) 
The operator U has a probabilistic interpretation which we shall speak about later; 
but its direct relevance to Tis given in Theorem 1. The inner-product notation 


1 
(#, F) = f n(1) dF(1) 
) 


.( 
Will be extensively used. 


THEOREM 1. The conjugate map U* to Uis T. 


PRoor. It is necessary to verify that (Us, F) = (, TF) for any continuous 
function (1) and any distribution F(1) with F(1) = 1 for r > 1 and F(1) = 0 fort <0. 
Indeed, 

(U1, F) = fa — t)a(ot) dF() + fete + (1 — o)t] dF(). 


By a change of variable, we get 


t t | hl 
(Un, F) (i = u(t) ar(£) + [0 ত 


ed EXD where G = TF. 


ar ( 


The value of Theorem 1 is that, by studying the iterates of U", we deduce 
Corresponding results about the conjugate operators T". We proceed now to study 
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this operator U. To be complete, we should denote the operator by Usa but where no 
ambiguity arises we shall drop the subscripts. Let W denote the isometry 


Want) = «Al — 1). 


Clearly W™' = W. We now observe the identity 
NL (3) 
The mapping (6, 2) = (1 — 2, 1 — 5) of the parameter space into itself has the effect 
of mapping the triangle of the unit square bounded above by 1 — % — ¢ = 0 into the 
other triangle located in the unit square. This isomorphism property enables us to 
restrict our attention to the case where I —- 4 —- 5 > 0. Corresponding theorems valid 
for the other circumstances, where | — % — ¢ < 0, are deduced easily by virtue of (3) 
and will be summarized at the end of this section. From now on in § 2, unless explicitly 
stated otherwise, we shall assume that I —- 4 — 5 > 0. 

The next two theorems, which we state for completeness, are immediate from 


THEOREM 2. The operator U preserves the ralues at 0 and 1. 


THEOREM 3. The operator U is positive: that is, it transforms positive con- 
tinuous functions into positive continuous functions. 


In particular, if (1) > (1), for all 1, then Un) > Us. 


THEOREM 4. If, m,..., nl") > 0, then Uz, (Us), ...,(Un)" > 0. 


PRoor. A simple calculation yields 
(Un) = (1 — neato) + ( — ay'n( + (0 — a) 


+l — ay Int My + —-2)0)—-no tat Mt). (4) 
Since 


St EaPl( = ofl 


we conclude since =" (1) is monotonic increasing that 


an MG + (1 -12))>2 nn) > oO. 


The assumption that | —- >on implies that (I —- yp i> ot, 
it follows that (Un)'" > 0. The 
Or O StS — hh 


As nr) > 0, 
Same conclusion and argument apply to (Un 


In particular, U transforms positive monotonic convex functions into functions 
of the same kind. Although in the proof of Theorem 4 we assumed the existence of 
derivatives, the argument can be carried through routinely at the expense of elegance, 
by use of the general definitions of convexity and monotonicity. 


THEOREM 5. If c2 100) >0 for 0 Zi En, thon (Urnyo(l) < K; for 
0 <i < nandhence(Urn) (GG) < K.. 


PRroor. The proof is by induction. By Theorem 2, the theorem is trivially 


true for i = 0. Suppose we have established the result for the ith derivative with 
0B Eh =h Equation) yields 


(Um™() — al) = clas Ml) _ cote M5) + [0 — aye — Hal), (5) 


Un 
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where cy(z) and eso) are constants depending only on x and o respectively, and on n. 
If : 
‘"1(1) > MCOsx, 0, €), 


Where Mis a constant sufficiently large, then (5) yields 
(Un)(l) < a1). 
Since ci(2) and css) do not depend on k, and by the induction hypotheses 
(Urey of SM 


uniformly in A and ., we find in general that when (Utz7)"(1) becomes larger than 
MCOz, o, c), then 
CUETO) < Un 0): 


((]) for Kk > ko are bounded by 


Consequently, the iterates (U 
M(z,0,c) + c(2)M + cao) M. 
This trivially implies the conclusion of Theorem 5. 

The proof of the next theorem is due originally to R. Bellman. We present 
it for completeness. 

THroREM 6. There exists at most one continuous solution of Uz = = for which 
(0) = 0 and (1) = 1. 

PRoor. (By contradiction.) Let =; and =» denote two solutions with the pre- 
Scribed boundary conditions. Put so = #1 — #2: then o(0) = yl) =0. Let to 
bea point where =, achieves its maximum. Since 

(to) = (1 — to)=(oto) + tonla + (1 — a)to), 


we deduce that oto is also a maximum point. lTterating, we find by continuity that 
n(0)) = 0 is the maximum value of =(f). A similar argument shows that 0 = min (1), 
Which implies that i = ns. 

THtroreEM 7. For any function nt) =" with x >r>l, Unt) converges 
tniformlr asn + 7. 


PRroor. Clearly r > t” > ptt). where 


1-8; for0 <t <: 


and 1, is close to I with r fixed. Since Ut is convex by Theorem 4, and the values at 


0 and I are fixed, we find that r > Ut. Hence 
Us Ue ret 


and lim U"t = 01) for every 1. Since (1) is convex. and by Theorem 5 the derivatives 
Of Uns at 1 are uniformly bounded, we conclude that ((r) is continuous. By Dini’s 
theorem the convergence of U"t to Kr) is uniform. Obyviously, U0 = 0. On the other 
hand, if to iS close to 1 then (UpY(l) <p(l) (see the proof of Theorem 5). Since 
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Theorem 4 guarantees the convexity of Up, and the slope at 0 is 0, it follows that 
Up <p, and hence Up < Urn therefore lim U"p = ¢(1). Again, (1) is a con- 
tinuous fixed point, and therefore by Theorem 6 we infer that é(f) = 01). On account 
of Ut > Unt" > U'p, we deduce that lim U"t™ = (1) with the convergence being 
uniform. 

We denote this unique fixed point of U by ¢ 
biguity arises. 


(1), or by (tr) whenever no am- 


a,x 


THEOREM 8. The iterates U" converge strongly (that is, U"z converges uniformly 
for any continuous function 1). 


PRooF. The constant functions are fixed points of U". Consequently by 
Theorem 7, U"g converges uniformly for any function g(1) in the linear space L spanned 
by the functions (1, 17). The set Lis dense in the space of continuous functions. More- 
over, as ||U'"|| = 1, by a well-known theorem of Banach, U"g converges strongly 
when applied to any continuous function g(r). 

The actual limit is easily seen to be given by 

lim U"g(t) = q(t) +O -— %o.2(0)]. (6) 
nem 
This is an immediate consequence of the fact that the fixed points of U consist of the 
two dimensional space spanned by the function I and #,.,. Equation (6) shows that 


two functions q, and gq» which agree at 0 and 1 have the same limit. This enables us 
to show: 


THeoREM 9. If q(1) is any bounded function continuous at O and 1, then U"q 
converges strongly. 


PRooF. Let g(t), in addition to being continuous at 0 and 1, possess finite 
derivatives at 0 and I. Then clearly there exist two continuous functions hi(t) and 
hs(t) with 


h(t) > q(t) > hat), 


where hi(0) = ha(0) and hi(1) = As(l). We conclude the result from this using the 
argument of Theorem 7 and equation (6). If now q(t) is only continuous at 0 and i 
then we can find for any ce a qt) satisfying the properties assumed about g(f) in the 
first part of the proof with g(t) — qe(t)l < ce. As ||U"|| = 1, the conclusion of the 
theorem now follows by a standard argument. 


THEOREM 10. If l=(0)| <c; for 0 <i<m, then |Unatd(n| <i for 
OS<Si<m. 


PRoor. The proof is by induction. For r = 0, the result is trivial since U 


preserves positivity, and the constant functions are fixed points of U. Suppose we have 
established the result for r = m — I. We note that 


Ur) = (1 + t)o"ninil(ot) +40 — 2) Of sx h(E = x)t] 


+ ml —- a lim Dy + (1 — ax) moti Vo). 
This easily yields that 


max |Un'"'(t)| < 2 max la‘(1)| + Cmax |" ny 
t t t ig 
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where 


2 =max([(l —- no" + (0 -—- oa)" <. 
t 
Therefore, 


max (U2) (0) < 2 max (U*-t2)(1)] + C max I(U*-17)™-(0)| 
¢ t |] 
< imax (Ue Dn) + K 
t 
by our induction hypothesis. lterating this last inequality yields 


¥-1 
max (Ur) < S AK + A max |=) < M. 
t f= t 


This establishes the theorem. 
THEOREM 11. If q(0) belongs to C" (n continuous derivatives), then 


lim [U™g(n] 
m0 
converges uniformly for O<r<n-l. 
PRroor. We prove the theorem only for r = 1, for the other cases are similar. 


On account of Theorem 10, the uniform boundedness of (U"g)'® implies the equi- 
continuity of Ug". Thus we can select a subsequence converging uniformly since 


Ug!) are also uniformly bounded. Let 

S(t) = lin Ug. 
Since lim U"'.q converges uniformly to a unique limit (1), we obtain 0°(1) = Y'(. 
As 001) is independent of the subsequence chosen, the conclusion of the theorem easily 


follows. 
THEOREM 12. The fixed point ®o,x is analytic for 0 < t < 1 with 0 > 0. 
PRrooF. Let p(f) denote a function infinitely differentiable with p‘"(1) > 0 and 


P(0) = 0, p(l) = 1. By virtue of Theorem Il and Theorem 4 we deduce that 
lim (Up) = $02 > 0. 
ne 
Therefore ¢,. is absolutely monotonic and hence, by a well-known theorem, is analytic. 
At this point it seems desirable to summarize the analogous results of Theorems 
2 through Theorem 12 for the case where + o < 1. We enumerate the correspond- 
Ing theorems. 
THEoREM 4°. If (Di tal) > 0 for i=0, 1, 2,4, and nt) 2-0, then 
(~D (WwnG) > 0. 
In particular, positive increasing concave functions are transformed into func- 
tions of the same kind. 


THeoREM 5°. If C2>2(1)>0 and C>(-liiahdG)>0 for 1<i<n, 
then 0 < (_DF UU) (0) < K,, and hence |U's()| < Kiforl <i<n. 
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Theorem 6 remains unchanged and is valid independent of the conditions on 
x and c, provided only they lie in the open unit interval. 

Theorem 7 holds with a modification of the proof where p(t) is replaced by 
the concave function 

tL Torl iz} >] 

{a 
Ls OED EF Et 
Lio J 
and the functions t” are replaced by 1 — (lI — 1)". These also constitute, with the 
constant function, a family of functions whose linear span is dense in C(O, 1]. This 
enables us to infer the validity of Theorem 8. Theorems 9, 10, and IL, with suitable 
changes in their statements which we leave for the reader, are established by simple 
appropriate modifications similar to that indicated above for Theorem 7. The unique 
Solution ¢,, for this situation, where x» + o < I, is completely monotonic and hence 
analytic. In the remainder of this section the theorems are established without any 
specification as to the value of x + 6. 


P(t) = 


THEOREM 13. The functions 
Bint) = S Ul — 1] 
converge geometrically to 0. 4 
PRoor. It is immediate from (6) that 
Ur[r(l — 1)] = Y(t) 


tends uniformly to zero. Since the derivative at 0 and 1 of (1 —1)is 1 and —1, we 
conclude by Theorem 11 that for n sufficiently large there exists an ny(2) such that 


Urotl =n <Ad = 


with 2 < 1. Let kn, denote the last integer A for which kny < m. We obtain 


CY ! 


X Ul =H EC < Cptoik < Cp", 


0 < pt) < hinlt) < 


1-2 


where 


p = AV.) <1. 


TheorEM 14. If q(1)is continuous, gC < oS and 14(0)| < cs, then lim U"[g(0)] 
converges &eometrically.. 


| PRooF. We first establish the result for special functions tr with I <r < x. 
A simple calculation shows that ৰড! 


ICE! = 10-8 UO TP ECHL = 1). 
For n < m, we obtain upon continued application of U and summation that 
n n 
=€ 2. Ul = ME UG = Un) ESC 2. UC =): 
im 


The conclusion now follows from Theorem 13. The general function q(0). satisfying 
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the hypothesis of Theorem 14, can be bounded from above and below by two poly- 
nomials P(r) and P(r) which agree at 0 and 1. The result now follows directly from 
this fact and the first part of this proof. 
We observe easily the identity 
U-t=(2+0-H(-1. 
Applying successively U and adding, we obtain 


$2 =limU"t=t+(2+06-1) > UL - 7. (7) 
1 


nex n 


This is useful for purposes of calculation. 
Some remarks describing the dependence of é,,,0n cand % are in order. We 


consider the following identity: 
$l 
UE =U = SS UEUa — Uo) Us = 8) 
i 0 
If (1) is any function with bounded derivatives, then we obtain by the mean-value 


theorem that 


UG. — Ussaf 


IA 


I —- non -foD] +f +(- at) 
—-f« +0 - a) 
<C(lo-—ol+l2-)U-n). 
Applying equation 8 to f(t) = box, and remembering that inequalities are preserved 


by Theorem 2, we obtain 
n 


1 
[URdox — teal £CUe—-o|t+la-wh) > UICC -). 
AAA fi i=0 
Allowing n to go to x, we have easily that 

[$0.2 = bowl < K(llo-—o|tla- al), 


where K() is finite, provided that 0 < 1 < a, oc, <l-n<l. 

It is worthwhile to discuss the nature of és.» for (5, a) lying on the boundary 
Of the unit square. First, we observe by direct verification that when % + oc = 1, then 
$n.) = a. Nextlet x = 0 and o <1: then 

U¢ = (1 — x)¢(ox) + 26(x). 

Therefore, if 4 is a fixed point with (0) = 0 and (1) = 1, then for x # I we have that 
%(2) = $4), and hence ¢(r) = #0) =0 (0 <a < 1) provided that ¢ is continuous 
at 0. Similarly, wheno =! and x < 1 then the only fixed point ¢ continuous at 1 
and satisfying (0) = 0, ¢(1) =, is &() ss! for 0 <x <1. On the other two 
boundaries of the unit square the solutions are easily calculated and turn out as follows: 
IfO0 < o <lis arbitrary and x = 1, then 
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where L° = I and the operation L applied to.cr gives x + (I — 2x). Finally for + = 0, 
c = 1 the operator U reduces to the identity mapping. We now investigate the 
dependence Of bo, ON G and as we allow c and » to tend to the boundary. We limit 
our attention for definiteness to studying the case where (5, x) — (5p, 0) with oo < I, 
and we show that %,., converges pointwise to 0 for 0 <r < 1, and ¢, (1) = 1 other- 
wise. Moreover, the convergence is uniform in any interval 0 < <l-—-0d< lL, 
Let (Gp, xn) — (op, 0); then without loss of generality we may assume that I — on — 
xn > 0. Therefore the Pay.xn are convex, monotonic increasing and positive, with 
bo,,xnl0) = 0. Also, for any interior interval0 <. <1 — o <1, the first derivatives 
Goria are uniformly bounded. Since this implies the @,,,., Are equi-continuous over 
the subinterval, and as 0 < %,,.2, < 1, we can select a subsequence which may be 
denoted as %s,., converging to (1) uniformly, for any interval of the form0 << 
1-6 <1. As 


$o,axdl) = 1, 
we get J'(1) = 1 and similarly 1'(0) = 0. The uniform convergence of #,.>, guarantees 
the continuity of I" at zero. 
Put 
U, = Us ap Up = Uso and r= tors, 
We consider the following identity: 


FUN =(YF 4%) +06 -UN)+ON-UT)=IH +h tos 


We take a fixed x < 1; then trivially || = I" — ¢%,] < e when r is sufficiently 
large. Also 


[Ll = 14, — UN = Ui, — UN] = 0 -— o)[bdore) — t(6,0)] 
+ abo, + (1 2)0) — UG, + (0 ado]. 


But for x = ty < 1 fixed, we observe that 4, + (1 — x). Varies in an interval 
<1 — as «x, — 0, and the same applies to c,.f. The uniform convergence of fo, — 1 
inside 0 <x <1 — 0 yields la] < e. By construction, |/3| < e for r large. Thus 
we infer the equality 4° = Uo! for 0 <. < 1, and by direct verification for + = 1. 
However, the fixed point to the equation Us'T" =" with 1°(0) = 0, 4°01) = 1 and T° 
continuous at 0 is I(r) = ! for 0 < + < 1 and (1) = 1. Thus the limit function 1° 
is the same for every subsequence of 2,, ,,, and hence we deduce that ¢, » converges 
pointwise. We furthermore note that \'is independent of cy <1. A similar analysis 
applies to the case where (oc, 2) —~(l, 4) (% > 0). The continuity properties of the 
solution for the other two boundaries yield to simpler analysis. Summarizing, we have 
established the following theorem: | 


THeoREM 15. The fixed points %,.x satisfy the following continuity properties: 
HfO<H <a, <land0<o,o <], then jg l 


[6a = $x < KO)lle — | +l]. 


If(s, 2) — (no, 0) with oo < 1, then $0) —~ 0 pointwise for 0 <x < 1 and sl) =. 
If (s, x) — (1, x0) with x0 > 0, then ¢, ,(r) — I pointwise for 0 <x <1. i 


Finally, a word concerning convergence of U"s; for 7 continuous when the 
rameter values lie on the = 
para a boundary. When + =0, 6 < 1, then U's converges 
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pointwise. The same conclusion holds when x > 0 and co = !. On the other two 
boundaries the convergence is uniform for U"#. We omit the proofs. 

We now return to the study of the operator T. 


THrorEM 16. For any distribution the iterates T"F converge in the sense of 


distributions to the distribution 


G() = Ii) | daxdF + IO - $o.2) dF, 


where IoC) and I(r) are the distributions concentrating fully at 0 and 1 respectively. 


PRooF. From the convergence of UU" for any continuous function = and 
Theorem | follows the weak *convergence of T"F. This is equivalent to the convergence 
of T"F in the sense of distributions. The actual form of 

lim T"F=G 


nex 


as given in the theorem follows directly from (6). 


By choosing the distribution F = I, We obtain from Theorem 6 that %, alto) 
represents the probability with which the limiting distribution concentrates at 1, or in 
other words—as can be easily shown—the probability with which the particle beginning 
at ro will converge to 1. This furnishes a probability interpretation to the fixed point of 


the operator U which is different from a constant. 
In connection with Theorem 8, we remark that U"s cannot converge for an 


arbitrary Lebesgue measurable bounded function. In fact, if we assume that U's 
converges for every bounded measurable function (1), then T"F would converge 
Weakly if F were absolutely continuous. Since the space of all integrable functions 
LO, 1] is weakly complete, and T maps distributions into distributions, we could find 
a fixed point TF = F with F absolutely continuous and total variation 1. However, 
in view of (16) the only fixed distributions which exist concentrate only at 0 and 1, and 
hence cannot be absolutely continuous. 
Finally, we present a slight applicatio 
expected position of the particle converges geo 
although the iterated distributions converge slow 
expected position of the particle is given by 


1 
f zx dF(2) = (2, F), 
0 


n of Theorem 14. We show that the 
metrically for any starting distribution, 
ly to the limiting distribution. The 


Where F is the cumulative distribution describing the position. The expected position 


At the nth step is given b 
P!s bh &, TF) = (U's, Ft 
converges geometrically. which establishes the asser- 


On account of Theorem 14, U"ax ঠৰ সজা 
ents. This observation is very useful 


lion. The same conclusion applies to all the mom 


for computational and estimation purposes. | 
Finally, we note that the spectrum of the operator T cannot consist of the 


isolated point 1. Otherwise by standard techniques one can show that U"s converges 


fi 
Or any measurable bounded function w. 
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2. In this second model the random walk is described as follows: If the particle 
is at x, then x — x + (1 — 2) with probability é(.r) and + — or with probability 
1 — (x), where 

I$) —- HOM SH<I. 


The analogous transition operator to (1) becomes 


rf Mz - 1-2) 


G(r) = TF = (1 — (1) (dF (1)) + (0) dF (1), (9) 


0 0 


with the same understanding concerning F applying as before. Let 
Uz = [I — $(n)]=(ot) + (sx + (1 — at). (10) 


In this section, we take 0 < 2, ¢ < 1: the case where boundary values for % and 
are considered is easy to handle but not of great interest. The spaces on which they 
operate are the same as in § 1. Again, in a similar manner to Theorem I, we obtain: 


THEOREM 17. The operator T is conjugate to the operator U. 


We now further assume that ¢(f) is monotonic increasing. This model includes 
the important case where &(t) = 2. + ut, where 2 + mt < 1: and whenever 2 + 1 = 1 
then /. > 0. 


THEOREM 18. The operator U preserves positivity and positive monotonic in- 
creasing functions. 


PROOF. Direct verification. 


Since the hypothesis on 4(1) implies either #(1) < 1 or ¢%(0) > 0, we analyze 
the case where (1) < 1. The other circumstance can be treated in an analogous 
manner. Furthermore, we now assume that if 4(0) = 0, then 4'(0) exists and is finite. 


THEOREM 19. If (1) is monotonic increasing bounded and positive, then UU" 
converges uniformly to a constant. 


The proof can be carried out easily using the techniques employed above. 
The hypothesis on (tf) easily yields the fact that the only continuous fixed 
points of Ur = r are constant functions. The proof is similar to the proof used in 
Theorem 6. This fact directly connects with the result of Theorem 21 below. First, 
we complete the proof of convergence of U's for any continuous function (1). 


pl 2 0p, eC ন , হে 3 . 
THEOREM 20. The operators U"s converge uniformly for any continuous function. 


PROOF. Since |1U"|| = 1, and the space of all monotonic positive continuous 
functions spans a dense subset of the set of all continuous functions, the theorem follows 
by a well-known theorem of Banach. 


THEOREM 21. For any distribution F, the distributions T"F converge as distribu- 


tions to a unique distribution G for which TG = G which is independent of F. 


PRroor. The weak *convergence of T"F follows directly from Theorem 20 and 
Theorem 16. To complete the proof we must establish that if lim T"F = G and 
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lim T"H = K, then G = K. Indeed, let I" denote any continuous function. We have 
that 
(NEG = RK) = Hn TUF — HI = lin (ONE, F =H) = al Ey [an] =0 
ৰ an 


ne 


as F and H are distributions. Hence 


Yr) dF(r) = | T'(n) dK() 


for any continuous function ‘I’, and therefore G = KA. 

It seems extremely difficult to determine the complete nature of this unique 
fixed distribution. We shall say more about it in a later section. We denote it by 
Fs 
THEoREM 22. The distribution Fx is a continuous function Of 6, 08 that is, 
if (ou, xn) — (a, 2) with 0 < o, 2x <, then Fox, F,.2 at every point of continuity of 


oi 
P94 

PRooF. Let(c,, 2) — (oc, 2); by Helly's theorem we can choose a subsequence 
F, = F, ., converging to the distribution F at every continuity point. Write T; for 
To, and T for Ti... Let (fr) denote any fixed continuous function. We consider 


the quantity 
ton. = TEV = FE =F) Sh Cnn) — (er; TE) 4 (rs TF — TF). 
Since F, —~ Fas distributions, we find for r sufficiently large that (nn, F-F,) <e. 
Now we note that 
(a, ED: = la TE =e, FED) — (ns TEA = MO — Un; Fi). 


Since U = U, , converges strongly to U = U,,,, as is trivial to verify, it follows that 


Uj, converges uniformly to Us. Whence, as F; are distributions, we infer that 
(Ur — Us, Fi) < max |U,# — Uzn| < e 
t 


When r is chosen large enough. Evidently, with r large we get as before that 


(a, TF: — Fl = (Us, Fr — F)I Se. 
Therefore we obtain for r large that |(s, F — TF) < 3., and hence (#, F) = (#, TF). 


Since # is any continuous function, we infer F = TF and therefore F = F,,, by 
Theorem 21. Consequently, as any limit distribution of F,,.», must be F,, the con- 


clusion of Theorem 22 is now immediate. 
3. The model considered in this section is with ¢(.x) = 1 — +. In this case ¢ is 
Monotonic decreasing. The operator U becomes 
Uz(t) = talon) + (1 — Dall — x + alt). (12) 
Note that we have replaced «x by 1 — 2. This is only for convenience in Theorem 28, 


and does not restrict any generality. In this model the closer the particle moves to 
the ends 0 and I the greater probability there is of moving back into the interior. The 
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situation described here is of completely reflecting boundaries. Again it is easy to show 
that the only continuous fixed points Uz = are the constant function. Therefore, 
we shall find as in §2 that the distributions describing the position of the particle 
converge to a limit distribution independent of the initial distribution. We first proceed 
to analyze convergence properties of U"z. In this case it is no longer true that U 
preserves the class of positive monotonic functions. Only positivity is conserved by the 
mapping U. However, a new quality as described in Theorem 23 serves here well. 

Throughout this section in order to avoid trivial changes of proof and different 
results at times, we suppose that 0 < 2,6 <. 


THEOREM 23. If #(1) has a continuous derivative, then 


max (Unr)'(1)l < max |r'()l, 
¢ ‘ 


with equality holding if and only if s(t) is linear. 


PRrooF. By direct computation, we obtain 


“'(t) = ton‘(ot) + (I t)ar'(l x +at)+nAn)- nl -—-2+ 2). 
Hence, with the aid of the mean-value theorem we get 
max |Uz'(t)l < max lton'(ot) + (1 - an (l—- 21+ ut) (13) 
# [ 


a(t) —- (ll —-a+ al 
st —-(-o2)- at 
<max[tis + (1-2 +1 -2-(s-— 1) max (| = max [0 
t t t 


+(at~—( 2) — 2t) 


If equality holds, then let rp denote a point where 


max |#'(1)| = |=(to)l. 
tl 
It follows easily from (13) that 


(oto) — (ll — a + ato) 


max |r'(t)| = |r'(oto)l = ln'(l 2 + ato) = 
Ty — (1 -— a) — aty 


(14) 


This yields that a(t) is linear for 0 <t<l-u1+ to, Or otherwise somewhere 
between oto and 1 — % + 2t0 the slope has greater magnitude than the slope of the 
chord subtended by (1) at these points. Equation (14) shows also that cto and (1 — 


% + 40) are maximum points of #‘(1). Repeating this argument successively then 
implies that equality in (13) requires (1) to be linear. 


THEOREM 24. If n(t) belongs to C™ [x(r) possesses m continuous derivatives], 
then max, (U" n)(1)| is uniformly bounded inn foreachr(O <r <m). 


PRroor. The proof is similar to that of Theorem 10. 


THEOREM 25. If #(t) possesses two continuous derivatives, and o # 2, then 
Uz converges uniformly to a constant. 


Remark. The reason why the two cases o = 2 and o # 2 are distinguished, 
and necessarily so, will be explained later. 
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PRoor. In view of Theorem 23 and Theorem 24, the first and second derivatives 
of U's are uniformly bounded. Thus U"sz and (U"z)’ constitute equicontinuous 
families of functions. We can thus select a subsequence n; such that U's: converges 
uniformly to ¢(r), and (U".=)" converges uniformly to &'(1). It follows trivially that 
Un"; tends uniformly to Ué and 


U2 ন Ut. 


Moreover, by virtue of Theorem 23, 
max |(U"#)'| 2 max |(U".-!7)'| > max (UT 17). (15) 
t t | 
Hence 
lim max I(U".z)'| = lim max (U"‘!7)'| = lim max |(U"-27)'|. 


i=0 1 i-* x t ix (tt 


Therefore, by the uniform convergence of the derivatives, we secure 
max |4'(1)| = max (US)'(n)l = max I(U®$)'(0)|. 
t t t 
Invoking Theorem 23 yields that ¢(1) and U$¢(!) are linear. However, if x # cand ¢(1) 
contains a term with rt, then U% is quadratic. This impossibility forces ¢(1) to be 
identically a constant. Let i be chosen sufficiently large so that 
lU"r —- dd <e 
Then 
[Uns — cl <tlUna(ot) - cl +O0- nN)lUrnl —-«+at)-d<e. 


Repeating this argument shows that 
[Ur —- cc <e 


for any p. This establishes that U" converges uniformly to c. 

THeoREM 26. If #(1) is continuous and 0 #2, then Us converges uniformly. 

PRoor. The space of all functions with two continuous derivatives spans 
linearly a dense subset of the space of all continuous functions. Since ||U'"|| = 1, 
We obtain the result using Theorem 25 and a well-known theorem of Banach. 

In the next two theorems we establish the uniform convergence of U's for the 
case where | > a = « > 0. We note in this case the interesting fact that U applied to a 


Polynomial does not increase its degree. Particularly, 


Ur = [a — nal — ah" + P(t), 


Where P, (+r) denotes a polynomial of degree n — I. 


THEoREM 27. If P(1) is any polynomial, then U*P converges uniformly to a 


Constant and the convergence is geometric. 

PRoor. The proof is by induction on the degree of the polynomial. Clearly 
If P is a constant = c then U*P sc. Suppose we have shown for any polynomial 
Py, of degree < n — 1 that the iterates U*P,, 1 Converge uniformly. To complete the 
Proof, it is enough to verify that Uta" converges uniformly. Let 


2= a — nat] — a); 
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then |2| < 1 since I > « > 0. We obtain 


Ur = hit + Py 0). 
Repeating, we get, fork >l, 


Ut" = jkr + J ZrUt IP, J 
This last sum is of the form 


পি 
cL = Dar bi 5 
ft 


with XD lal < 7, and lim bi,(r) exists. Itis a well-known theorem that lim c(t) 
k= 
exists uniformly whenever 


bir) = UF IP, ; 


converges uniformly. Thus, Ut." converges uniformly to a fixed point which must be a 
constant function. Finally we note that in the case where o = » (the rate of learning, 
SO to speak, is the same regardless of the outcome of the experiment), then (Gl id for 
any polynomial converges geometrically. The proof can be carried through by using 
induction. 

This yields the fact that the expected position converges geometrically toa 
limiting expected position with similar results valid for higher moments. 


THEOREM 28. If a(t) is continuous and oc = 2 > 0, then U"z converges uni- 
formily. 
PRrooF. Similar to Theorem 26, since the set of all polynomials is dense. 
We now note the important example that when » = o = 0 itis no longer true 
that UT" converges. It is easily verified that in this case U2"; and U2"‘l!7 converge 
separately but that a periodic phenomenon occurs otherwise. The argument of 
Theorem 27 breaks down in this case as the quantity 7 is —1. We only mention that 
other difficult convergence behavior occurs when 2, 5 traverse the boundary of the 
unit square for this model. In particular, when x = lI and a < | itis not hard to show 
re 5 
that Us does not necessarily converge for every continuous function #, and even 
for the circumstance Where is a polynomial. The case where a = 2x =! produces 
for sR identity for which the Convergence of U" is trivial. For x < l 
c= i © a le baat 2 
Wwe can conclude again a lack of Convergence. However, when » = 0 an 


1>০ =0 gd f 
CA 0, or co =0 and I > + > 0, then EL converges for every continuous 
function s. ks 


We return now to the hypothesis 0 < 2% 


THEOREM 29. 


If (1) belongs to C™, the [* 
Ig s then (LU 


PEE AU 7 
n)(1) converges uniformly Jf 


PRooF. This follows easily from Theorems 24 26, and 28. Let 


rls (rsx Ix 
TF = [tur 
ls tdF(t) + | (1 — 1) dF(). 


his repres i Ss x 
% Presents the transition law for the distribution describing the position of the 
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particle for this model. By arguments analogous to those employed in the preced- 
Ing sections, we can establish the following theorems, using the conjugate relationship 
between T and U. 


THEOREM 30. For any distribution F the distributions T"F converge as distribu- 
tions to a unique distribution Fo. for which TF, = Fox which is independent of F. 


THtoRrEM 31. The distributions Fy. constitute a continuous family of distribu- 
tions in the sense of Theorem 22. 

Again it seems very difficult to determine any more explicit information 
about Fy. 

4. The model examined here is such that 1 — (x) = AA+uwith2A+ns<! 
and at least I > 2 or0 < wu. The operator U has the form 


Ur = (2r + nox) + (1 -— Ar — Dal - a+ ar). (16) 


Of course, as before, 0 < 2, a < |. Convergence questions for U"z turn out to be very 
elementary in this case in view of the following theorem which is easily proven. 
THEOREM 32. If n(x) has a bounded derivative, then 
max (Ur)'()| < a max EAE) 
zr r 
witha < l. 


ce of Theorem 32 is that (U*)’ converges geometri- 


An immediate consequen 
f distributions for this model. In 


cally to 0. Let T denote the transition operator © 
the standard way, we obtain: 


distribution F the distributions T"F converge to the 


THEOREM 33. For any 
= Fo,4- Moreover, 


distribution Fx which is a continuous function of (6, x), and TF.» 
Fy is independent of F. 


5. This section is devoted to some variations of the preceding models. A new 
feature added first is that we allow in addition to the two impulses of motions towards 
the two fixed points 0 and 1 by the transformations 


Fit = ox and Fyt = lL — & 0 


ticle stands still with certain probability. 


the possibility of a third motion where the par 
rning problems, and much statistical 


These models are particularly important in lea 
investigation on this type has been done by M. M. Flood [5]. They are referred to as 


the pure models. The mathematical description of the first model of this type is as 
follows: A particle x on the unit interval is subject to three random impulses: (1) 
t > cx with probability wil — 2): QO) 12+ ar with probability wo": and 
(3) 2 — x with probability (1 — al — +) + (1 — 2), where0 < ni, n2 < 1. This 
is similar to model I where absorption takes place at the boundaries 0 and 1. The 
Operator analogous to (2) becomes 


Os as nl — 2)n(ox) + [dd -—-mn - 2) + (1 -— ra)r]jn(a) 
+ rarn(l —- ata). (7) 


398 READINGS IN MATHEMATICAL PSYCHOLOGY 


Again, let T denote the transition operator which maps the distribution locating the 
particle into the corresponding distribution at the end of the experiment. Theorem I 
is valid for this setup, and Tis consequently conjugate to U. It is easy to verify that U 
fulfills the conditions of Theorems 2 and 3 and also preserves the property of monotone 
increasing functions. Furthermore, we obtain: 


THEOREM 34. If rs, 1’ and =” > 0, then (Ur)" > 0 if and only if 
tl =o) 4 T20&. = 1) 0; 
and otherwise Um preserves with sv and #' > 0 the property of concavity. 
PRoor. The proof can be carried through by direct computation. 


We remark that the remainder of the analogue to Theorem 4 does not carry 
over under the condition stated in Theorem 34. Moreover, noting that we have here 
changed « into I — % as compared to § 2, we obtain for 1, = Tm» = 1 the condition 
of § 1 for preservation of convexity, and so on. 

The analogues of Theorems 5, 6, 7, and 8 easily extend to this model by the 
same methods, and we obtain that Us converges uniformly to a limit given by 


[= Banna ONTO) + Poin, sl), (18) 


Where #s.x,7,.7, IS the unique continuous fixed point of U¢ = ¢ with #(0) = 0 and 
(1) = I. The entire theory of geometric convergence, continuity of ¢ as a function 
of 0, 2, Ti, and 2, and the form of the limiting distribution of the particle established 
for the model of § 1 remains valid with slight changes in the proofs. The general con- 
clusion is that introducing a probability of standing still has no effect on the convergence 
of the distributions or its limiting form provided only the essential feature of absorbing 
boundaries still prevails. Finally, in this connection we remark that for special bound- 


ary values of the parameters 7, and 7, the motion may become a drift to one or other 
of the end points: for example, 7, = 0, 7» > 0. 


6. We treat in this section, the following general nonlinear one-dimensional 
learning model. The particle moves with probability (x) from x to 1 — a + at 
and with probability I — ¢() from + to ox. The function is only continuous with the 
additional important requirement for this case that ¢(+) > 5 > 0 and 1 — #(x)>9>0 
for all. in the unit interval. This excludes the types of models discussed in §§ 1 and 3, 
but includes some subcases of the examples investigated in §§ 2 and 4. However, in 
those cases we obtained much stronger results about the rate of convergence of deriv- 
atives, and so on. The transition operators become 


(71+ D2 


ESL 
Th: =] [1 — $00] dF(t) +] #(1) dF(), (20) 
[| 


) 0 
and Tis adjoint to 


(Un)() = (1 — $())n(ot) + #(O0)n(l — a + a). (21) 


We shall show that U" converges uniformly for any continuous function (1). The 
proof of this fact shall be based on the following highly intuitive proposition. Let 
an experiment be repeated with only two possible outcomes, success or failure at each 
trial. Suppose further that the probability of success p,, at the nth trial depends on the 
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outcome of the previous trial, but that these conditional probabilities satisfy pn» 2 1 > 

0; that is, regardless of the previous number of failures the conditional probability of 

Success is always at least 1) > 0. Then the recurrent event of a success run of length r 

with r fixed is a certain event; that is, with probability 1 it will occur in finite time. This 

result can be deduced in a standard way using the theory of recurrent events [4]. 
We turn back now to the examination of U's. Let 


Fiz=om and Ft =l-et+o0 


and by Fr denote the operation that either F; Or Fis applied. We note the important 
Obvious fact that 
IF’ — Fl < #e- yl, (22) 
with 0 < 2 < 1, where F’ denotes r applications of F; and Fs in some order acting on 
t and y in the same way. 
Next, we need the important lemma: 


LEMMA. If I$] SK for m= {0 TEE 
[Unatm(r)| < Ks uniformly in n and t. 


and |=O(0)| < Kj, then 


PRoor. The proof is similar to that of Theorem 24. 


Now let (1) denote a continuously differentiable function. Consider the 


following identity: 
Una) — Una(y) = (1 — $ODCO — SQDLU" ta(Fi®) — Un-!a(Fy)] 
+ $ODSQDLUT-tn(Fge) — U"2n(F2y)] 23) 
+ (01 — $ODHDLUT n(Fgt) — Unta(Fy)] 
+ ODN — SODLUT AF) — U" ta(Fy)]. 
We continue to apply this identity to the factors Uni) — U"-!a('); and when any 
term of the form U"an(F"w) — U"an(F':) is achieved, then that factor is allowed to 
Stand without any further reduction. All other terms are reduced to expressions 
Involving as factors #() — n(). Thus we obtain 


Una) — Un) = I +I, 


When 1, consists of terms of the form 

SpilU"a(Fw) — Unn(F"2)), 
consists of the remaining terms. We now conceive of the 
Let two particles undergo the random walk described 


ly. We say a success occurs if the same 
The probability of 


and Xp, <1 while I 
following probability model. 
by this model starting from x and y, respectivel 
Impulse activates both particles, and otherwise failure occurs. 


Success is given initially by 
HW) + -— SOU — $Y] 2 262:>40; 


and it is easily seen that each pi, Where pr is the conditional probability of success 


Occurring on the kth trial, satisfies 
Pp: 2 26 > 0. 
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Consequently, a success run of length ris certain to happen in finite time. In particular 
as n — 0, 1; — 0, since I, is bounded by twice the probability of no success run in n 


trials times K. On the other hand, in view of the lemma and equation (22) we secure 
that I? < C/. Therefore, 


lim |U"n(2) — Una(y)l < CH, 


nem 


which can be made arbitrarily small as r —> oc. Hence, if 


lim U"n(y) = a 
exists for a single y, then 


lim U"n(2) = a 
no 


for every . Since a subsequence can be found so that 


lim Unin(2) = a 
for one x and hence for all x, an ar' 


gument used in the close of the proof of Theorem 25 
Shows that 


lim U"a(z) = a. 
nw 
The lemma easily im! 


plies that the convergence is uniform. Using the fact that 
INU" = 


1, we can sum up the conclusions for this nonlinear mo. 
THEOREM 35. 
to a constant limit. 


del as follows: 
If (1) is continuous, then lim U" 


T exists uniformly converging 
new হ্‌ 


THEOREM 36. If ¢(1) belongs to C™, and a(t) is in C™, then 
lim (Unz)ym(t) =0 


new 


with convergence uniform Int. 


EM 37. For any distributi পিকের লসৰত e 
} ThHeoRE| Ee ly distributions F, TF converges to a distribution Fo, 
independent of F with a,x = F,2 and F,, continuous with respect to o, «x. 
This last theorem follows on account of the 


J k conjugate relationship of T and U. 
Finally, we note that the method used in this section can be employed to analyze 
the random walks with any number of impulses. 


Fr =(-—au)n + air. 


7. In the present section we investigate the nature Of the limiting distribution 
obtained in the various models. In the case Where the boundaries Ee absorbing 
states as in §§ I and 5, we find that the limiting distribution is discrete and concentrates 
at the two ends 0 and 1. The weight at 1 depends on the starting distribution F and 
is given by 


1 
|| $e, (0) dF(), 
[1] 


where %,,, is the unique continuous fixed point of U¢ = $ with 4(0) = 0 and 4(1) = 1 
Many properties of ¢,,, are developed in those sections. In all the other types the 
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ergodic property was seen to hold and the limiting distribution was independent of the 
initial distribution. Let us deal with the following general type. The random walk is 
given by 2 —~ Fit = ox with probability 1 — $02), and » = Fyzt = 1 — o + ax with 
probability é(.t), where I 9 > ¢(x) > 9 > 0. The relevant operators are given by 
equations (20) and (21). Let the limiting distribution be denoted by Fo, 

We now distinguish two cases: (a) co > 1 — 2 and (b)o <1 —a. Let us 
examine case (b) first. We note that the union of the image sets Fi[0, 1] + Fsl0, 1] of 
F; and Fy applied to the unit interval does not overlap with the open subinterval 
(6, I — 2). Any two applications of F; and Fs leave empty the two additional open 
intervals (62, (1 — 2)o) and (o(1 — »),(1 — 2)*). Proceeding in this way, we find that 
the limit of the total set covered by n applications of Fi (i = 1,2) in any arrangement is 
a Cantor set C. It is easily seen that Fx must concentrate its full probability on this 
et C. 

Now let 
[Life = tol 
0, iE Et) 


q(t) = 


We show that U" (+) converges uniformly to zero. Note that Ut,(1) is zero for every 
tf except at most one value of 1; namely, Fi Ugor. Fs to Of course, fa <li < l=, 
then neither inverse exists for that to: and otherwise only one exists and 


IU] < max [#(0),1 — $O)] ST -S. 
Ed 


Similarly, Win Sth = 5)", from which the assertion follows. We now observe that 
(ays Fon) = (tos TFaa) = (Un, Fo) 0. 


Consequently, the probability Of Fi. at tois zero for any to with0 < to < 1. Summing 
Up, we have established: 


THEOREM 38. If o<l-o, then the limiting distribution Fo, isa singular 


distribution (probability zero at every point) spread on a Cantor-like set. 


We now turn to examine case (a) where o > 1 — «. We note first that at least 
One of the two mappings Fj! or Fs ' is defined for every x in the unit interval. Let (ft) 
denote any continuous positive function defined on the unit interval so that (1) > 1 > 
0 for some subinterval to = <1 < to + h(h > 0). Since at least Fj * or Fj! exists 
at 1, (say Fi; 1), we obtain Fi 1f, = ti. We construct 13> from fi; in the same way and 
Continue this for n steps, obtaining t» = F "to, Where F *" denotes a specific order of 
Application of FE or Fo a total of n times. Let F" denote the reverse order of the 


Operators obtained by passing from to to tn. We note that 
IF" — Fl <2 EXE An, 


Where 2. < 1. Choose n so large that A" < Ah; then for every x we get that 


[Fw — Fl = IF —- tol <A. 


Consequently, as 
I =1 = =43%%) >> 0, 
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Un"; is positive for all # since F "[t, — h, to + hi] covers the entire unit interval and 
a(t) > 1] > 0 on this initial interval which is spread out by the term in U” involving 
F". We have thus shown: 


THEOREM 39. Ifo > 1 — «, the operator U is strictly positive; that is, for each 


positive continuous function a(t) there exists an n depending upon s so that U's is strictly 
positive. 


Now let =,,(1) be defined as before. Again we establish that U"n,, converges 
uniformly to zero. To this end we observe that Un, has at most two possible values 
at Fito and Fj; to given by 1 — #(F; ro) and HF; ro), respectively, while Un, =0 


elsewhere. Also, Us, has at most four possible values and the maximum value that 
could be achieved for Utz, is 


max {[1 — #(F LHC — (FF; 0)], HF; “)4(F 10), 
[1 — HFT t0)]¢(Fz!F; to) + HF o)[l — HF; Fs 0)]} 


To secure a bound for the maximum of U'"z,,, let us consider the same repeated- 
experiment model set up in the previous section. The conditional probabilities of 


Success p,, at the nth trial satisfy the uniform inequalities I>! -॥> Ph2zn> 0, 


where Success in this case is taken to be an application of the impulse F; to the particle. 

It is readily seen by standard inequalities that the Probability of securing k (k <n) 

Successes converges uniformly to zero as n — 0. Moreover, it follows directly that 

max,(probability of k Successes) is a bound for Uns, and hence UU", — 0. We 
0 


deduce as before that F,, has ili i 
educe a 2 Probability zero for every 1. HVE 
distribution of F is continuous. Let F= % RET 


সর + Fy, wh i i 
and Fy is singular. Observing that the ৰ en Cre BE SOIL SONUNUGTS 


k j ransition operator transforms absolutely 
continuous measures into absolutely continuous measures and singular measures into 
singular measures, we find that TF; 


= Fj and TF, = F,. H i 
BELLS $ a 2 = F,. However, as the fixed dis- 
tribution is unique, we deduce that either Fj or F, vanishes 

a THEOREM 40. Ifo > 1 -— a, then the unique distribution Fis either absolutely 
continuous or singular. Furthermore, Fu. has positive measure ln, Gueny. open interval. 
PRroor. We have demonstrated all the concl 
Let n(t) denote a continuous function bound. 
1, and 1 on a closed subinterval I‘ of I. B 
that U's > 5 > 0 for all t. 


usions of the theorem but the last. 
ed by 1, and zero outside an open interval 


Y virtue of Theorem 39 there 


exists an n such 
We note that erie 


(mn, Fi) = (#, THE, a) 


Bt =U, Fi) 55 > 0, 


[ars Sl Fi) 2 5 516; 
and the proof of the theorem is complete. 
We close with the conjecture that when > 


j 1 — «, then F,, is al Ale 
solutely continuous. An example where this is the ্ 1 Fo, is always 
es Ol = tg linre la) Ee. Case is furnished by $01) s 1/2, 


SAMUEL KARLIN 403 


REFERENCES 


[1] R. Bellman, T. Harris, and H. N. Shapiro. Studies on functional equations occurring in 
decision processes. RM 878, RAND Corporation, July, 1952. 

[2] R. R. Bush and C. F. Mosteller. A mathematical model for simple learning. Psych. Rev., 
1951, 58, 313-323. 

[3] J. L. Doob. Asymptotic properties of Markoff transition probabilities, Trans. Amer. Math. 
Soc., 1948, 63, 393-421. 

[4] W. Feller. An introduction to probability theory and its applications. New York: Wiley, 
1950. 

[5] M. M. Flood. On game learning theory. RM 853, RAND Corporation, May 30, 1952. 

[6] O. Onicescu and G. Mihoc. Sur les chaines de variables statistiques. Bull. Sci. Math., 
1935, 2, 59, 174-192. 

[7] W. Docblin and R. Fortet. Sur les chaines a liaisons completes. Bull. Soc. Math. France, 
1937, 65, 132-148. 

[8] R. Fortet. (These) Sur V'iteration des substitutions algébrique linéaires a une infinité de 
variables. Revista, No. 424, Ano 40, Lima, 1938. 

[9] Ionescu Tulcea and G. Marinescu. Sur certaines chaines a liaisons completes. C.R. Acad. 
Sci. Paris, 1948, 227, 667-669. 


Received December 19, 1952. 


SOME ASYMPTOTIC PROPERTIES OF LUCE’'S 
BETA LEARNING MODEL? 


JoHN LAMPERTI AND PATRICK SUPPES 
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STANFORD UNIVERSITY 


. This paper studies asymptotic properties of Luce’s beta model. Asymp- 
totic results are given for the two-operator and four-operator cases of con- 
tingent and noncontingent reinforcement. 


Yor application to various simple learning situations, Luce and his 
collaborators, Bush and Galanter, [1, 7] have considered a learning model in 
which the changes in probability of response from trial to trial 
functions of the probability of response on the preceding trial. Both theoretical 
and empirical considerations have motivated the development of the beta 
model. Some learning theorists like Hull and Spence believe that overt 
response behavior may best be explained in terms of a construct like that of 
response strength. Irom this viewpoint stochastic learning models which 
postulate a linear transformation of the probability of response from one 
trial to the next, with the transformation depending on the reinforcing event, 
are unsatisfactory in so far as they offer no more general psychological justi- 
fication of their postulates. From an empirical standpoint there is evidence 


in some experiments, particularly certain T-maze experiments with rats, 
that the linear stochastic models do no 


yield good predictions of actual 
behavior [1, 7]. ্ 


are not linear 


On the basis of some very simple postul 
Luce has shown that there exists a ratio 5 
the property that 


ates [7] on choice behavior, 
cale v over the set of responses with 


= HO) 
ESS 
where p; ., is the probability of response A; on trial 1, and v,(i) is the strength 
of this response on trial n. Additional simple postulates Id the result 
that the 1,(i) are transformed linearly from trial to trial, and this unobservable 
stochastic process on response Strengths then determines a stonhastie process 
*This research was supported in part by the C 


} 23 
of Naval Research and in part by the Rockefeller Round nt loE PTs TENA 
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in the response probabilities. Superficially, it would seem that the simplest 
way to study the asymptotic behavior of the response probabilities—a 
subject of interest in connection with nearly any learning data—would be 
to determine the asymptotic behavior of the response strengths v,(7) and 
then infer by means of the equation given above the behavior of the response 
probabilities. This course is pursued rather far by Luce [7] and encounters 
numerous mathematical difficulties. We have taken the alternative path of 
studying directly the properties of the nonlinear transformations on the 
response probabilities to obtain results on their asymptotic behavior. 

We restrict ourselves to situations in which one of two responses, A; 
and A, , is made. Let p, be the probability of response A, on trial 1, and let 
2, be the event of reinforcing response A; , and Es the event of reinforcing 
response As. 

Luce’s beta model is then characterized by the following transformations: 
if A; and LE; occurred on trial n, then forj= 12andk = 1,2, 


0 pu = TET 
Dn + Bl — Da) 

Where B,, > 0. Luce [7] gives a more general formulation. (Generally, we 
want B,, < 1 and B;s > 1, to reflect the primary effects of reinforcement; 
moreover, it is ordinarily assumed that 811 < Bx»: < Biz < Bs.) Throughout 
this paper it is assumed that 0 # pi = 1. lL 

The most important fact about (1) is that the operators commute. For 
example, suppose in the first n trials there are bi occurrences of AiE, , bs 
Occurrences of Asb, , bs occurrences of AEs , bs occurrences of AEs ; then 
1b is easily shown that 
(2) is Di র্‌ 

Dnt = TF BEATA — PY) 

tudy asymptotic properties of the 


The aim of the present paper is to stucy L 
beta model for certain standard probabilistic schedules of reinforcement. 


The riethods of attack used by: Karlin tl, &nd-by Lamperti and Suppes: [6] 
for linear learning models do not directly apply to the nonlinear beta model. 

The basis of our approach is to change the state space (the probability 
Db, is the state) from the unit interval to the whole real line in such a way 
that the transformations (1) become simply translations. The noncontingent 
Case (the next section) then reduces to sums of independent random variables; 
the contingent cases can also be studied by “comparing” the resulting random 
Walks with the case of sums of random variables. The probabilistic tool for 
this is developed and applied in later sections. The general conclusion to be 
drawn from our results is that for all but one case of noncontingent reinforce- 
Ment individual response probabilities are ultimately either zero or one, 


Which is in marked contrast to corresponding results for linear learning 
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models. Absorption at zero or one also occurs for many, but not all, cases of 
contingent reinforcement. 


Noncontingent Reinforcement with Two Operators 


If the probability of a reinforcement is independent of response and 
trial number, we have what is called simple noncontingent reinforcement. 
Let 7 be the probability of an E, reinforcement, and for simplicity let 


Bi = Bs, = B, 

6) Bis = B22 = 7, 
0<6 El, 
es JB > a 


We seek an expression for the asymptotic probability distribution of response 
probabilities in terms of the numbers T, B, and +. 


The random variable 1, is defined recursively as follows: 
I A with prob sr, 
n= 
Y with prob (1 — Ts 
A I with prob , 


my with prob (1 — য). 
The random variable X, is defined as follows: 


X, = lo ib 
Then ৰ 
0 RE z + log B with probs, 
X. + logy with prob (1 — 71). 


It is clear from (4). and what has preceded th: 


দ ট ক! at X, is the s i endent 
identically distributed random variabl 0) Es 2 


es Y; defined by 
eS | B with prob গা, 
logy with prob (1 — য). 
aw of large numbers, with Probability One as n —> w 


(6) Kio if 5B LL = 


Re 3 


By the strong | 


7) log + > 0, 


if log B+ ( _T) logy < 0. 


Define now for any real number + 


6) F,( = Ee 
( Dn) ER 
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Then p,., = Fix,(pi) for the sequence of reinforcements m, , where X;, = log 1: . 
Thexe results are utilized to prove the following theorem. 


THEorEM 1. Let c = tT log B + (1 — 1) log 7. Then with probabzlity one 
ন | 0 ce 0; 


Ds 
li tf ee 0. 


1] c = 0, then pi oscillates between 0 and 1, so that with probability one 
lim supp, = ! 
lim inf pa. = 0. 


Despite this oscillation, there ts a limiting distribution for pn ; it is concentrated 
at 0 and 1 with equal probabilities 5. 


PRroor. The results for c¢ > 0 and c¢ < 0 follow immediately from (5), 
(6), and the remark following. In ease ¢ = 0, note that E(Y.) = 0. It is 
known [2] that the sums NX, are then recurrent; that is, they repeatedly 
take on values arbitrarily close to any possible value. In particular, X, takes 
On repeatedly arbitrarily large and arbitrarily small values (with probability 
one), which upon recalling (6) proves the second statement. The third state- 
ment is a consequence of the central limit theorem, which implies that for 
any A, PECK 5 AD and Pri, < — 4) both converge to one-half as n 
Increases. Again the assertion of the theorem follows from this fact and (6). 


Two Theorems on Random Walks 


The results of this section are special cases of those in [5]. However, 
the present approach has the advantages of simplicity and directness. 

We have seen that the two-operator, noncontingent beta model gives 
rise to a Markov process on the real line such that from 2 the “moving 
Particle” goestoxr + aorr — b with (constant) probabilities p and 1 — 9. 
The contingent case leads to a similar process, except that the transition 
Probabilities become functions of 2. The four-operator model gives rise to a 
Process with four possible transitions, from 2 to rt + a;,say,? = 1 2,3,4. 
In this section some simple results on processes of these sorts will be obtained, 
in preparation for the study of the more general cases of the beta model. In 
the interest of clarity, only the two-operator case will be treated in full; the 
More general case can be handled in a similar way, but the details are cumber- 
Some. Our approach was suggested by the work of Hodges and Rosenblatt [3]. 

Let {X,} be a real Markov process such that if X, = 2, 


NEE " + a with prob ¢(2), 
bE 
¥—b with prob [lL — el; 


(9) 
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where0 < a, be), 1 = 2(x). Let {Y,} be another process of the same type 
(and with the same a and b) but with constants 6 and 1 — [/) 


as the tranxition 
Probabilities in place of s(x) and 1 — o(). 


LEMMA. If forall + > HM, one has p(x) > 0, and fPrY,—=-+ =) > 0, 
then P(X, > + 2) > 0. If, on the other hand, forz > M, (1) < 9andif 
Pr(Y, > + =) = 0, then Pr(X,—->+ o) = 0. 

PRoor. Let {t&,} bea sequence 
uniformly distributed on [0; 1. Th 
by letting 


of independent random variables, each 
e {X,} process will be referred to {E,} 


dd) Xi a Rd I ni EE 

Xa — b otherwise. 
This does lead to the transition law (9) as may easily be seen. The {Y,} 
Process can be linked to {X,} by referring it after the manner of (10) to the 
Same sequence {£,}, so that Yay = EE alt and only if &,, < 6. 

Choose Yo > MM. Whatever the value of Xo , since p(x) > 0 there is 
positive probability that KX. > Yo, for some m; therefore assume 2 
We now assert that for those sequences {Y,} with the property that Y, > NI 
for all n, the inequality X, 2 Y, is also valid for all n. This follows from our 
construction “linking” the processes, and the assumption that el) > 0 
for + > J; the transition Kye = Ki Briand Hie I 0d impossible, 
S0 X, — Y, can only increase. 

To complete the Proof, note that since PoC, = 
PIC, 55 Fe 2 JI for all n). But the event +4 
n may be considered as a set S in the sample space of the sequence |t&,}; 
S is a set of positive probability, and is contained in the set 
On.S;, Ki > Yi, and Y¥, — o., Hence Br — 
of the lemma is proved in a similar way, using the 
{X,} and {Y,}. 


88) 48 positive, so is 
+ =, Y, > HM for all 


YX, — oo since 
w) > 0. The second part 
Same construction linking 
THEOREM 2. Let b/(a + b) = c, and Suppose that 

(11) lim ¢(2) =a and lim (21) = B 

exist. Then if a < c and B>£ 

2) Pr (lim sup X, = “fo, lim int X, = — 0) 
whileifa < (>) candB < (> )c, then 


1 (IX,] is recurrent) j 


(18) Pr(X,—->-—-o(+o))= 1. 
Inally, fa > cand B < 6 


(14) Pr(X,~>+o)= 5, Pr(X,2-o)=1_5 
for some 0 < 6 <]. 
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Proor. Suppose, for instance, that « < c. Let {Y,} (as in the lemma) 
be a process with constant transition probabilities 8 and 1 — 9 where 
« < 0 < c. The {Y,} process may be regarded as sums of random variables 


Fi = Fi EZ, Where Pr(Z: =a) = Bb and 
(15) Fs 
Pr(Z; = —b) =1- 60. 

But £E(Z,) = a0 — (I — 0) < 0, since 8 < c; this implies that 
Pr(¥, > — ow) = 1 hy the law of large numbers. From the lemma, 

BCX = 5) = 0: 

Similarly, if a > cit follows that Pr(X, — + =) > 0. Since the lemma 
also holds for convergence to — ~ (with p and 06 replaced by 1 — and 
1 — 06), we obtain in the same way that 8B < c makes Pr(X, > —- =) > 0, 
while if 8 > c this probability is zero. 

Consider the ease when a < cand B < c; there is then positive probability 
of absorption at — =, but not at + . It is not hard to see that XA — = 
with probability one; the idea is roughly as follows. Since X, =» + oo, we 
have X, < N infinitely often with probability arbitrarily close to 1 for some 
N. Now the probability that from or to the left of N the random walk goes and 
remains to the left of N — IM must be positive since Pr(X, > — =) > 0. 
But in an infinite sequence of not necessarily independent trials, an event 
whose probability on each trial is bounded away from zero is certain to 
occur. Hence for any 11, the random walk will eventually become and remain 
to the left of N — AI, and therefore Xn — — with probability arbitrarily 
close to 1 (and so equal to one). The other cases are similar; one can think 
of a > cor a < cas the conditions under which + o is an absorbing or 
reflecting barrier, ete., and the process behaves accordingly. 

The generalization to the four-operator case will now be described. Let 


{X,} be a real Markov process such that if X, = 2, then 


(17) Xi = 2 + a, with prob ¢,(2), 
Where .a, , as > 0 > a;, a, and v,(1) > 0. Suppose 
(18) lim 4.2) =a; and lime.) =: 
exist, and let 
4 4 
us = Daa and u-= 2 AB: - 
i=l fm 


By methods entirely similar to those used above, but rather more involved, 
it is possible to prove the following. 

THEoREM 3. For the process {X:} described above, if u, < Oandu- > 0 
then (12) holds; if nu. < (>)0 and u_ < (> )0 then (13) apples; while if u, > 0 
and u_ < 0, (14) 1s valid. 
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Contingent Reinforcement with Two Operators 
If the probability of reinforcement depends only on the immediately 
preceding response (on the same trial), one has (simple) contingent reinforce- 
ment. Let Pr(E, | Ai) = mn, and Pr(E, | Az) = rs, and let the two operators 
B and + be specified as in (3). Using (6), define the random variable X, 
recursively. (Note that log 7 appears first, since log + > 0 and log 8 < 0, 
in order most directly to apply Theorem 2.) 


X, + logy with prob Fx (p)(l —- ri) 


(19) Xi = + (1 -— Fx(p))(l — nr) = o(X;), 


(X, + log B with prob [1 — ¢(X,)]. 
Observe that 


(20) lim (2) = 1 — 2» and lim 0) =l-nm. 


Combining (20) and Theorem 2, one then has immediately Theorem 4. 


THEOREM 4. For the contingent case of the two-operator model, let c 
— log B/log (Y/B). Then with probability one 


(WY df L— oa €tand t= as > then 
lim SUpp,= 1 and liminfp,= 0, 
(D)fl-r:<candl—-m < Blhen P= = 1, 
(DHL n> ¢omdl = ns. ten Pe = 0; 
AI oreover, 
(iv) ifl-r2>candl-—rn, 


Pr(p.—-21)= 5s 


< c then for some SwithO0 < 5 < 1 
1 FO 40) = l= 5; 

The intuitive character of the distinction between the results expressed 
in (i) and (iv) of this theorem should be clear. If 1 — কব < Gnd 1 — mri > 6, 
then probability zero of an A; response and Probability one of an A; response 
are both reflecting barriers, whereas if 1 — 72> c amndl-mri<c they are 
both absorbing barriers. fl 

It is also to be noticed that except when 1 — গলা 
Theorem 4 covers all values of B, 7, T, , and mrs 
be shown [5] by deeper methods that if 1 — rr, = c (6 1 = a= 6): then 
probability one (respectively zero) of an A, response is again a reflecting 
barrier. These results agree with those given by Luce ([7], p. 124) and in 
addition settle most of the open questions in his Table 6. Detailed comparison 


is tedious because his classification of cases differs considerably from ours as 
given in the above theorem. 


L=COrl=— T=, 
for the contingent case. It can 
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Contingent Reinforcement with Four Operators 


We want finally to apply Theorem 3 to the contingent case of the general 
four-operator model formulated in (1). Analogous to (19), 


XN, T+ log B22 with prob (1 — 2) = Fx.(p)) = 2X), 
YX, + log Bis with prob (1 — m)Fx (pi) = exAN,), 


C&D) Fa, 
X, + log B», with prob ms(1l — Fx(p)) = ea(X,), 
X, + log Bi, with prob mFxA(pD) = eX). 
Also, 
lim p22) = 1 — 2: lim 22 = 0, 
lim ei2) = 0, lim e122) = 1 — mi, 
(22) I+ ৯ 
lim a(t) = Te, lim ¢a(2) = 0, 
lim vu(2) = 0, lim pul) = Ti. 
Then 


(23) u. = YD logBix lim pit) = m2 log Ba: + (1 — m2) log B22, 
1k + 


and 
(24) un = YD loghBie lim pia) =m logy + (1 - mi) log Bi . 
Pa bani! 


> 1 > Bn hn = 0: 


To apply Theorem 3 one also assumes that B22 , Bi2 
infer Theorem 5. 


On this assumption, and utilizing (23) and (24), we 
“THEOREM 5. For the contingent case of the fowr-operator model, with 
probability one 


() fu, <O0and nu > 0 then lim supp. = 1 and liminfp, = 0, 


(i) fu, < Oandu_. <0 then ps = 1, 
(iii) if nu, > 0 and un- > 0 then pe = 0; 


and if nu, > 0 and u- < 0, then for some suwith0<i<l 


WS Bip = Plo = 1%. 


Specialization of this theorem to cover the noncontingent case is immediate. 
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CHAINS OF INFINITE ORDER AND THEIR 
APPLICATION TO LEARNING THEORY 


JoHN LAMPERTI AND PATRICK SUPPES 


1. Introduction. The purpose of this paper is to study the asym- 
ptotic behavior of a large class of stochastic processes which have been 
used as models of learning experiments. We will do this by applying 
a theory of so-called ‘‘chains of infinite order’’ or ‘‘chaines a liaisons 
complétes.’’ Namely, we shall employ certain limit theorems for sto- 
chastic processes whose transition probabilities depend on the entire past 
history of the process, but only slightly on the remote past. Such theo- 
rems were given by Doeblin and Fortet [3] in a form close to that we 
employ; however, in order to accomodate certain cases of learning models 
we found it necessary to relax somewhat their hypotheses. A self-con- 
tained discussion of these and some additional results is the content of §2. 

We should emphasize that this section is included to serve as prep- 
aration for the theorems of §4, and it is oricinal with us only in some 
details and extensions. In addition to [3], papers by Harris [7] and 
Karlin [8] contain very closely related results and arguments, but not 
quite in the form we require. 

The processes which we shall study with these tools are called ‘linear 
earning models." From a psychological standpoint these models are 
very simple. A subject is presented a series of trials, and on each 
trial he makes a response, which consists of a choice from a finite set 
of possible actions. This response is followed by a reinforcement (again 
one of a finite number). The assumption of the model is that the sub- 
Ject’s response probabilities on the nest trial are linear functions of the 
probabilities on the present trial, where the form of the functions de- 
pends upon which reinforcement has occurred. Many results about such 
modcls may be found in Bush and Mosteller [2]. Estes [4]. and Estes and 
Suppes [6]. We will also study here models constructed along similar 
lines for experiments involving two or more subjects and a type of in- 
teraction between them |6, Section 9] and Atkinsen and Suppes [1]. 
Precise definitions of these processes are given below in §8. 

The references mentioned above do not, except in very special cases, 
EKive a thorough treatment of asymptotic properties. We shall prove 
that under general conditions linear learning models exhibit “ergodic 
behavior; that is, that after much time has passed these processes be- 


come approximately stationary and the infiuence of the initial distributions 
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goes to zero. This is not the case for all models which have been 
used in experimental work, but it Seems as if ergodic behavior can be 
proved by our method in almost all the case 


it. Our theorems to this effect, their pro 
given in §4. 


Sin which one might expect 
Ofs and some corollaries are 


The major work so far on limiting behavior 
Karlin [8], who obtains detailed limit theorems for certain classes of 
models. However, the results and even the techniques of Karlin’s paper 
do not apply to many cases of interest. His starting point is a repre- 
sentation of the linear model as a Markov process whose states are the 
response probabilities. Two typical situations when such a representa- 
tion is impractical arise (i) when the probabilities with which the rein- 
forcement is selected depend on two Or more previous responses, and (ii) 
in the many-person situations mentioned above. Both these situations 
Can (and will) be studied using infinite order chains, and ergodic behavior 
established under mild restrictions. On the other hand, Karlin’s work 
treats interesting non-ergodic cases ontside the scope of our approach. For 
example, consider a T-maze experiment in which the subject (a rat, say) 
is reinforced (rewarded) on each trial regardless of whether he oes lett 
or right. In the appropriate linear model, the probability of a left turn 
eventually is either nearly 0 or nearly 1, and which 


the rat’s initial response probabilities. The model of this experiment 
has been thoroughly studied in |8, Section 2], and these results have 
been generalized by Kennedy [9]. 


of learning models is 


it is depends upon 


In conclusion we comment that both more detailed results and other 
applications seem possible using the ideas of “infinite order chains.” 
We hope to contribute further to this development in the future. 


2. Chains of infinite order. In this section we present a theory 
of non-Markov stochastic processes where the transition probabilities are 


infiuenced only slightly by the remote past. 

theorems for this lype of process arc due 

they are given here in a Ecneralized form (Tr 
weaker hypotheses make the broof of Lemma 
it is in [3], but the other proofs 
has also studied these chains; 

that his paper [7] gives additi 
subject. Finally we point out 
States is not essential, and the 
numerable case without much ch 


The original convergence 
to Doeblin and Fortet [3]; 
1eorems 2.1 and 2.2). The 
2.1] more complicated than 
Are not much affected. T. E. Harris 
We shall not use his results but remark 
onal references and background on the 
that the restriction to a finite number of 
theorems can be extended to the de- 
ange of methods. 
Let I consist of the integers from 1 to N ( 
of the chain); we shall use the notation a for a 
of integers from I. The subscript ‘“‘m” 


to represent the states 
finite sequence Yo Lg ss 
On x, merely adds the Specifica- 
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lion that the sequence has m terms; the ‘‘sum” 2, + will be the 
combined sequence iy, --*, iui, Li, ty. The starting point for the theory 
will be a set of functions p(x) defined for all ie I and all sequences % 
(including the sequence 4 of length zero) and having the properties 


(2.1) PAX) = 0, Np) =. 


The function pix) will be interpreted as the conditional probability that 
a path function of the random process will go next to state 1, having 
just oceupied state i,, previously ii, etc. With this interpretation in 
mind we define inductively the ‘‘higher transition probabilities": 


(2.2) Di) = S DOD +0), 


where of course pi"(x) = pil), the given function. It is easy to see that 
these higher probabilities also satisfy condition (2.1). The functions 
analogues of the terms of the matrix P* for a Markov 


D(x) are the 
the theorems we shall give generalize 


chain with transition matrix P; 

the convergence properties of the matrices P”. 
We shall first impose a positivity condition on 

bilities; specifically we assume that for some state Jo, 


the transition proba- 
some positive 


integer n,, and some 5 > 0, 
(2.3) pol) > § for every % . 
un 
“slight’’ dependence of these proba- 


We also need to make precise the 
he crux of the whole theory. 


bilities on the remote past; indeed, this is t 
Define 

) s,= suplpiGe +) — Pie + 2") 
all states i, all sequences «x and *", and 


Where the sup is taken over 
e state j, at least m times. We shall 


all sequences a which contain th 
use the postulate 


(2.5) Ss a) 


m0 


is defined in the same way except that the supis taken over 


s Em 
Since this results in larger £8, and since 


all x» of length at least ™m. 
it is also assumed there that én <, our hypotheses are strictly 


weaker.) Throughout this section, (2.8) and (2.5) will be assumed. 


(In [3], s 


LEMMA 2.1. 
(2.6) lim [supp Ga +a) — ne +N =0. 


Where the sup is the same as in (2.4) (i.e., w contains ju at least m 


times); the convergence 1s uniform in NM. 
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Proof. We define quantites s,’ by using p:"' instead of pi; in (2.4); 
then of course £;,’ = s,, and the conclusion of the lemma is equivalent 
to > 0 uniformly in k as m— wo. Now 


Ip + a) = p(t 22) 
ISD "(+2 +a)p(e ta) pA +xt+ ape +a 
সস PG +o) (++) -pn (j++ an) 
+ XID +x)-pet+a)p (++). 
! 


IA 


Suppose that % contains j, m times. 


estimate is less than Ns,. ‘The absolute value in the first term is less 
than e#-", but if 5 = j, this can be improved to s;, Taking account 
of (2.8) and assuming that n, = 1, we obtain the estimate 


Then the second term of the above 


(2.7) 82S NE BEE? FOL — BEE 


(In case n, > 1, the same idea can be carried out; the details are more 
cumbersome and will not be given. ) 


Now (2.7) can be iterated to obtain an estimate of a 


£4 in terms of 
En. After some computation the result is 


[et] 


5 < Nende (1 — 8) + Neu BSG + DO 


- 5)! 


+--+ Ned! ডট (Le i toa BS 4+ Nee 


Smn+k-1 


If the series are extended to infinity, the inequality remains true; call- 
ing these (infinite) series A,, Ai, --., Ai, We have 


ES NS; Ens 10'A, 
But it can be shown without much difficulty that 


Aisi — AL = (1 — HAs, 


or Aisi = Aid. Since A, = 07" we obtain A, = 5-0, and hence 


K-1 
(2.8) He 6-৯ Bl 
Recalling hypothesis (2.5), the uniform convergence of £' follows from 
(2.8). 
LEMMA 2.2. 
(2.9) lim [p."(') — pr" (")l=0 


and the convergence is uniform in x and x". 
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Proof. For clarity we shall use probabilistic arguments, although a 
purely analytic rephrasing is not hard. Consider two stochastic processes 
operating independently with transition probabilities pil), one with the 
sequence w+’ for its past history up to time 0 and the other with x”. 
In view of Lemma 2h for any 530 there is an m such that if the 
two processes have occupied the same states for a period which includes 
J, at least m times and ends sometime before time n, then their proba- 
bilities of being in state i at time n differ by at most £2. But it fol- 
lows from condition (2.3) that with probability one, there will sometime 
be a period of length m during which both processes remain in state Jo. 
We can take n large enough so that this simultaneous ‘‘run’’ of state 
before time n with probability not less than 1-—°/2. For 
this and all greater values of n, therefore, the two processes have proba- 
bilities of occupying state i at time 1 which differ by at most s, and 
this proves (2.9). It is also easy to see from (2.3) and Lemma 2.1 that 
n can be chosen uniformly in x’ and 2, 

With this much preparation we shall now prove the first theorem: 


Jo will oceur 


THEOREM 2.1. The quantities 


(2.10) lim pI"'(0) = 7, 


Re 


exist, are independent of x, and satisfy Ma=l the convergence 1s 
[] 


uniform in @. 
Proof. Applying (2.2) repeatedly, we have 


pi") 
= SS pin (ODL fine FD)" Dud + TF inci + DDO +N) 


Where tn = ily tm Therefore 


a“) = Hol 
hy ELC CES a 


P 


Fina + pO +2) — pO) 


and by Lemma 2.2, for any £ there is an 1 such that each term within 


absolute value signs on the right is less than £. Since the weights 
Di (Wt) pillit 4+ int ©) SUM to one, we have 
ol 


i 
m-1 


ID" (x) =. (x) LE 


and so p,"(r) has a (uniform in x) limit zi. Since there are a finite 


number of states, 


SE SBR (2) = lim \ p"(2) =1, 
ti tn ne 
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and this completes the proof. np 
Next we shall define joint probabilities. IE me iS Lt, 
C11) pl) = pi) 
= Py 


iets IE 


nH nh cE Si br 40s 


This is, of course, the probabilitity of executin 
t, Starting with past history a’. 
probabilities: 


(2.12) 


£ the sequence of states 
We can define also the higher joint 


Di) = 5 pla )pr (G+ a). 
JE 


Analogues of Lemmas 2.1 and 2.2 


can be proved for these quantities by 
the same arguments used already; 


in this way it is not difficult to prove 
THEOREM 2.2. The quantities 


(2.13) lim DE lt) = 7; 


m 


ewist, are independent of w', and satisfy N T= 1; the convergence 
Ys + is ony 
1s unform in wx’. 


REMARK. These two 
stochastic process with th 
is that the quantities 7 


theorems imply the existence of a stationary 
€ p(x) for transition Probabilities. The idea 
=» Can be used to define a probability measure 
On the “cylinder sets’ in the space of infinite Sequences of members 
of 1, and this measure can then be extended. This stationary process 
need not concern us further here. 

Finally we will 


prove convergence th 
which are useful in 


studying experimenta 
have a stochastic Process with the funet 
bilities, the probability D(t,) that the 
past history ,, is itself a 


eorems for certain “moments” 
l data. The idea is that if we 
ions pi(2) for transition proba- 


State at time m is i given the 
random variable, 


and so it makes sense to 
study E(p;(x,)). More formally, define 
(2.14) Ht) 5 pile 2)p, (2) 
nes EY a 


where Db. (%) is defined by (2:11). 


Thus wim, a 
Theorem 2.1 states that lim aim, a 


) is the same as D(a). 
im Ji exists. 


We shall now prove 
THEOREM 2.3. The quantities 


(2.15) 


lim im, t) = ov; 
লাট 


ewist for every positive integer »; Convergence is Uniform in x and the 
limit 1s independent Df ts 
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Proof. We use a simple estimate to show that ai(m,%) is a Cauchy 
sequence: 


lan +k+ho)-amt+khk,ol 
= | NS Di(tuersn + De, (0) — SD Dian + OP; 


Tekh Thik 
SN Ditmas FF = DD SF ODDe sll) 
Tmek th - 
+ SID 0) = Pita + O)ID, (0) 
Tomsk 
+1 S Dit + DD, al) = D(a HPD el). 
Tmsk+h Tm+k 


If in is chosen large enough, the first two terms will be arbitrarily 

small; this involves nothing more than the conditions (resulting from 

(2.8) and (2.5)) that se, — 0, and that a long sequence contains J, many 

times with high probability. The last term may be rewritten by carry- 

ing out the summation over all the indices except those in w»; this yields 
IS pi, + 2D (0)- DEOL S Sp") — pec) 


m 
z 
i) 


which is small for all h (and for all 2) if k is large enough, by Theorem 
2.2. Thusif n=m+k, lain +h,zx)-an,n)l is small for all h, and 
this proves that the limit (2.15) must exist; the limit is uniform in x 
since ain, ) is uniformly Cauchy. Another estimate along much the 
same line can be made to show that for any £ > 0, 


lam +k,2)-am+hk,o)l se 


provided m and I: are large. Since the limit of a’(m + hk,x) exists as 
m + k—> 0, we can conclude that the limit is the same for all x. 

It is also desirable to consider some additional ‘‘cross’’ moments 
involving pit) for several states at once; accordingly we define 
S p(en + DPD FH 2) Di(@m + D)DPa (®t) - 
The following theorem is then a generalization of Theorem 2.8, which 
treats the case k = 1: 


THEOREM 2.4. The quantities 
k 


(2.17) lim am, ©) = ine 


mo 


exist uniformly in ox for all non-negative integers v,---b, and all 
Ji---jiel, and the limits are independent of x. 

The argument used in proving Theorem 2.38 works in this case also 
with only trivial changes, and need not be repeated. Finally we remark 


that moments involving several values of n can be considered, and it 
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can be shown that their limits exist also. This provides a generaliza- 
tion of Theorem 2.2. 


3. Definition of linear learning models. The models we consider 
apply to an experimental situation which consists of a sequence of trials. 
On each trial the subject of the experiment makes a response, which is 
followed by a reinforcing event. Thus an experiment may be represented 
by a sequence (A,, Ei, A,, Ex, --. As E,, *--) of random variables, where 
the choice of letters follows conventions established in the literature: 
the value of the random variable A, is a number j representing the 
actual response on trial n, and the value of E 


» is a number k represent- 
ing the reinforcing event on trial mn. The relevant data on each trial 


may then be represented by an ordered pair (J, k) of integers with 
LEST St; Bnd OE St, that is, we envisage in general 7 responses 
and t + 1 reinforcing events. Any sequence of these pairs of integers 
is a sequence of values of the random variables and thus represents a 
possible experimental outcome. The general aim of the theory is to 
predict the probability distribution of the response random variable when 


a particular distribution, or class of distributions, is imposed on the re- 
inforcement random variable. 


In dealing with the general linear model 
t + 1 reinforcing events we are following the formulation in Chapter 1 


of Bush and Mosteller [2], although our notation is Somewhat different, 
being closer to Estes [4] and Estes and Suppes [6]. 


The theory is formulated for the Probability of a response on trial 
1 + 1 given the entire preceding Sequence of responses and reinforce- 
ments. For this preceding sequence we use the notation 


with 7 responses and 


Zi Thus 


ie = (i Ja, Racy dais seg hs Ys 


(It is convenient to write these Sequences in this order, but note that 


the numbering here is from past to resent, not the reverse as in §2.) 
Our single axiom is the following linearity assumption: | 
Axiom L. If EE; = tk and P(zx,) > 0 then 


(3.1) P(A, = jl) = (1 — 0,)P(A, = 


Where 0USHi, Ma SES land SX, j টি 


J te) 0s 


We obtain the linear model studied intensitively in [6] by setting: 


he = 4 fork 0 
fi = 0 fork =0 
(82) \ 1 


yp — 0 fork 
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A linear model satisfying (3.2) we shall term an Estes Model, and for 
such models (3.1) may be replaced by the simpler condition: 


(1—-)PUAs= Gln) HO HEY 
(3.3) P(A = Jl) = 0 NP(As = Gln) if Ei=k, k#0, k#] 
P(A, = jl en-i) if E,=0. 


Axiom L satisfies the combining classes condition of Bush and 
Mosteller. Upon replacing 0 by 1—a in (3.1) essentially their general 
formulation of the linear model is obtained, although they do not ex- 
plicitly indicate dependence on the sequence ,. 

We also define here certain moments which are of experimental 
interest and whose asymptotic properties we investigate subsequently. 
The moments ;., of the response probabilities at trial n are: 


(8.4) Qin = NS PAA, = Glen) P(n-) - 


T-1 


And if the appropriate limits exist, we define 


(3.5) (= IM Rin 3 


ne 


The moments (3.4) are formed in an unsymmetrical way; however, 
they enter in a natural way in the expression of quantities which are 
easily observed experimentally—for instance, the joint probability 
P(Asss = Ji An = I): (For other examples, see [6].) 

We are also interested in studying extensions of the linear model 
to multiperson situations. We may suppose that we have 8 subjects in 
a situation such that the probability of a particular reinforcing event for 
any one subject will depend in general on preceding responses and re- 
- 1 subjects as well as on his own prior 
responses and reinforcements. The data on each trial may then be re- 
presented by an ordered 2s-tuple (Ji ki, -.-, J. ks) Of integers with 
L =i te 0S hy Sth Ot i=1,.---,8, and any sequence of such tuples 
come. Let Ai’ and Ei!" be the 
bles for the ith subject on trial 


inforcements of the other s 


represents a possible experimental out 
response and reinforcement random varia 
n. We may then generalize Axiom L: to: 


Hein: FOE LETS S if BE = and P(r,) > 0 then 


(8.6) PALL = jl) = 0 — OPYPUAY! = Jt) ON 3 


Where 0 <M, MR Sl and SN = TL. 
Experimental tests of Axiom M for two-person situations are reported 
in Estes [5] and in Atkinson and Suppes [1]. Let x’, be just the 
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sequence of first n— 1 responses and reinforcements of subject i. It is 
a consequence' of Axiom M that 


P(Ay’ = jlo) = PAY = jlo), 


and it is in terms of xi", that we define moments (0, exactly ana- 
logous to (3.4). We shall also be interested in the joint moments 

(8.7) Vie Jn = 2 PLA? = Iii ws, AF = Jeltns) P(t)", 

Tn-1 

and their asymptotes Yi... if they exist. To work with these 
latter moments in terms of Axiom M we need the additional reasonable 
assumption that when all the n» — 1 preceding responses and reinforce- 
ments are given, the s responses on trial n are statistically independent: 


Axiom I. If P(c,-i) > 0 then 


P(A? = ji, ce-, AF? = jiltnc) = TPAD = ile). 

tel 
The experimental restriction implied by Axiom I has been satisfied in 
the multiperson studies employing the linear model. 


4. Asymptotic theorems for learning models. 


After dealing with 
some matters of notation, 


h We state general theorems on the existence 
of asymptotic moments. The hypotheses of the theorems give some 


broad conditions which guarantee ergodic behavior. We begin with the 
one-person models satisfying Axiom L. 


In this section it will be convenient to Us 
§2. Thus we may write P(A, = jlo, + %') in place of P(A, = Gla) 
to indicate we are interested in the last m terms Of 2-1. The Sura? 
tn + %' is just the combined sequence Mii=is We YESEIVG the subscript 
m for counting back m trials from a given trial n. 

To clarify the general theorem it is desirable to define in an exact 
way the notion of the conditional probability of a reinforcing event de- 
pending on only a finite number m of past trial outcomes: and inde- 
pendent of the trial number. 


€ Some of the notation of 


DEFINITION. A linear model has a Tein forcement schedule with past 
dependence of length m if, and only if, for all k,n and 1’ wilh 40. 4 5m 
and all cn, %' and x" R! 


(4.1) P(E, = ho, +0) = PE, =k lo 4 20). 
(Tt is UY stood that , includes the response A,, which precedes L, 
on trial n.) It is to be noticed that the Use of 1 on one side and mn’ on 
the other side of (4.1) yields independence of trial 


number. The term 
1 Proof of this fact is analogous to that of Theorem 4.8 of [6] 
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reinforcement schedule has been used because of its frequent occurrence 
with approximately this meaning in the experimental literature. For the 
conditional probabilities of (4.1) we shall use the notation 


(4.2) is = PE = kl, + 0). 


We may now state the first general theorem. 


THEOREM 4.1. Let - be a linear model such that 

(i) + ° has a reinforcement schedule with past dependence of 
length m*, 

(ii) there is an integer k* such that 

(a) 0. #0 

(b) there is a 5° and an m, such that for all sequences x and all 
integers m 

P(E = BIR) 28° 3 05 

Then the asymptotic moments of 7° all ewist and are independent 
of the initial distribution of responses. 


Proof. The central task is to characterize 7° as a chain of infinite 
order and show that satisfaction of the hypotheses of the theorem im- 
Plies satisfaction of conditions (2.8) and (2.5). With this accomplished 
the asymptotic theorems of §2 may be applied to +. It is most con- 
venient to take as states of the chain the ordered pairs (J, hk), where 5 
is Lhe response on trial 1, say, and k is the reinforcement on the pre- 
ceding trial. Consider now the reinforcement k* of the hypothesis of 
Let j* be a response such that M4. #0. (There is at 
since N,N, = l; in the Estes model j* =k.) With 


the theorem. 
least one such J* 
the pair (J*,k*) as the state j, of the infinite order chain, we shall 
establish (2.3) and (2.5). 

To verify (2.3), we use (ii)b of the hypothesis and the following 
equalities and inequalities, which hold for all x and mn: 


ad R a | 
Pda = It Bons ED) 
১ j = p+. ig 
= Si, Pda = 2 Ey = KH, Bh as Fh) 


Tmop-1 


ty) P(Ensmo [Cm ES Bn) P(t any-1 [2n) . 
Applying Axiom, L, the right-hand side becomes: 
S [01 — 0)P(Assms = Flyer + 0) + [PR BE 


my-1 


+ P(E = Kft FF 2) © Pils) 
OA Ss P(Biin, = BE lis Ha) P (a= n). 


Tmy-t 


OAs P (Bsn = k* |x) 
LCR by (ib. 


IV 


IV INV 
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To establish (2.5), consider the following equalities and inequalities: 


(4.3) [P(Asei = J, Eni = klz + x!) — P(A = J, Exe = kl +o") 
= Th. nl PAu = JE = k,x +x) — PAs = GlE=hket+al, 


Where %,. means the last m* terms of x, and where the sequence + 
contains at least m occurrences of k*, with m>m. 


The equality 
follows from (i) of the hypothesis, for by virtue of (i) 


Rene = PB = kl +2) = PE, = Klee). 
Applying Axiom L once to the right-hand side of (4.3) we get, ignoring 


TT : 
kxpt 


J 


[Ply = SIE = kit +0) — P(A = IB: = ht 0 


=(0-0)PUA, = jl +0) PA = let an). 


We do not know that 0 4 


L 0, but as we apply Axiom L repeatedly, we 
obtain the factor (1 — 0 


x.) al least m times, so that 
(4.4) P(A = 5, E=kle+s)— PAs, = j, Ex = kle + 2")l 
= (1 — 0..)"| P(A, = Gla) — P(A, la"), 
where h is the length of x:, The difference term on the right of this 
inequality is not more than 1, so that from (4.4) we obtain the estimate 
for m>m* 


Ey St 
nun = 


l= Dp 
whence 


which is (2.5). 

On the basis of (2.8) and (2.5) we know from Theorem 2.4 that the 
asymptotic Cross-moments of ‘, exist and are independent of the initial 
distribution of responses. But 


PAs = jt) = SE PU, = I.E 


nn = Kelty) 


and so the moments a, can be expressed as sums of th nents 
for the infinite order chain which insures the en 5 00 he hie 
ting moments (8.5) and that they do not depend upon ig Eh ONAGAS: 

There are several remarks to be made about the theorem just 
If all 8% + 0, the original condition Liven in [3] } 


25 would be satis 4 On- 
dition C8 IDWS Is lysion of Eases here Soe oF (hee sre ST WEAK 
trials without a reinforcement + are 0 (i.e. where there can be 
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proved. First, we observe that a simple sufficient (but not necessary) con- 
dition for (ii)b is 


(4.5) min zhe.z 0 # Ee 


The interpretation of (4.5) is that the reinforcing: event A* has positive 
probability on every trial no matter what sequence %,.. of responses and 
reinforcements preceded. A number of interesting experimental cases 
of the linear model can be described in terms of (4.5), (i) and (ii)a of 


Theorem 4.1. 


1. Contingent case with lag v. In the Estes model let P(E, = 
lA, = 2) = Tt), for all such that PAs = J) 30; 16 satisfy 
(4.5), we need only that for some k, Tul) #0 for all 3. Experimental 
data for v = 0,1,2 are given in Estes [5]. 


Il. Double contingent case. Let 
P(E, = lA, = J, Ani = I’) = Tigre 


for all « such that PAs, = J, Ani = Ti) 0; 

Then (i) of Theorem (4.1) is immediately satisfied, and for (ii)a and 
(4.5) we need a k such that 0. + 0 and for all J and J’, fi, #0. 

An interesting fact about (1) and (I) is that although they are 
simple to test experimentally and their asymptotic response moments 
exist on the basis of Theorem 4.1, there is no known constructive method 
for computing the actual asymptotes. (The Estes [5] test of (1) excludes 
non-reinforced trials which cause the computational difficulties.) It may 
also be noted that the convergence theorems in Karlin |8| do not in 
general apply to (UD), and apply to (1) only IE = 0 

On the basis of the proof of Theorem 4.1 we may, by virtue of 


Theorem 2.2, conclude that the asymptotic joint probabilities of successive 


responses also exist: 
COROLLARY 1. If the hypothesis of Theorem 4.1 is satisfied, then 
for every m the limit as n+ of 


EE PEL 


exists. 

We may regard the quantities P(A, = Jl); LOR LEI Er 288 
random probability vector with an arbitrary joint distribution F' on trial 
1, and distribution F, on trial 1. The following corollary is a consequence 
of the existence of the moments independent of the initial response 


Probabilities. 
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COROLLARY 2. If the hypothesis of Theorem 4.1 is satisfied, then 
there is a unique asymptotic distribution F., independent of EF, to which 
the distributions PF, converge. 

For the multiperson situation characterized by Axioms I and M, we 
have a theorem analogous to Theorem 4.1. For use in the hypothesis 
of this theorem we define the notion of reinforcement schedule with 
past dependence of length m, exactly as we did in (4.1), namely, we 


have such a schedule if for allk, 1<i=<s,allnand nw withnn>m 
and all 5,50 and x” 


Tatts, = PUBY = KD, ce, EY? = bfo, 4 


| S= be - ) 
= P(E! =k, ...i Ef = ble + i) 


THEOREM 4.2. Let . 7 be an s-person linear model such that 


() # has a reinforcement schedule with past dependence of 
length m*, 


(li) there are integers kt, for 11 
(a) 0th. 0, 
(b) there is a 5" 
all integers n 


8, Such that 


and an m, such that for all sequences x and 


P(E, = kOY, cc, Bn, = bla) 25 > 0. 


Then the asymptotic moments গং 


1 6: Im VS Yu... Of HH all exist and are 
independent of the initial distribution of responses 
Proof. The states of the chain arc now 


I } hain defined as 2s-tuples 
(i se JD, KO, cee, let), where j is the response made by the ith 
subject and ‘0 is the reinforcement for that subject on the preceding 
trial. Using the reinforcements k"* of the hypothesis let 3° be such 
that Us uu #0. We take (Gor, ee, GOO, [AE ken) as the state Jo 
for whieh we establish (2.8) and (2.5). : 


To simplif i it is con- 
venient to define: CEU 


Paxil, kle) = P(A? = G0, ..., AY = I, BLY = bt, -e, E® = KO|2), 
DunlJ fk, 2) = P(Aflti = GO|ED = b,c, Boo 


= ko, %); 


Tene = SE 
Km CE 


m+ 


Moreover, we omit the superscript notation from 0 and A 


To verify (2.8) we proceed exactly as in the proof of Theorem 4.1, 
applying now Axioms I and M instead of L, and we obtain that 


Dasma J, lels;) 2! HOA ood ৰ 
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For (2.5), we first observe that by virtue of (i) of the hypothesis 
and Axiom I 


[puailj, Ele + 2) — Duals kl + 2")l 
= Fwlll pag +0) = H ping, + 7)l. 
tet wl 


We notice next that the right-hand side is 
<n, shes (Golk, x + hi punljOlk, es + 2) 


— TH pialgOlk, se + "| 


it 


i Ups i Ik, +20) Ik, 2+ 0) Pag Dlk, 2 + 0”)l. 


Continuing this same development, we obtain: 


[| 


used in the proof of Theorem 4.1, if the 


And by the line of reasoning 
av, ..., kO)') at least m times the last 


sequence % contains state (5 
quantity is 

< N00 -— 0,0)". 
Provided m > m* this inequality yields an estimate of se, from which 
we conclude that (2.5) holds. The existence of the asymptotic moments 
then follows from the theory of §2 as in the case of Theorem 4.1. Q.E.D. 

A pair of corollaries follow from the theorem just proved which are 
exactly like the two given after Theorem 4.1. 

Finally, we want to remark that Axiom L involves linear functions 
which are distance diminishing, i.e., have slope less than one. The 
asymptotic results of this section apply to many learning models in 
which these linear functions are replaced by non-linear functions having 
this property. 


REFERENCES 


1. Richard C. Atkinson, and Patrick Suppes, An analysis of two-person game situations 
in terms of statistical learning theory, J. of perimental Psychology, 55 (1958), 369-378. 
2. Robert R. Bush, and Frederick Mosteller, Stochastic Models for Learning, New York, 
1955. 

3. W. Doeblin, and R. Fortet, Sur 
France, 65 (1937), 132-148. 

4. W. K. Estes, Theory of learning with constant, variable, or contingent probabilities of 
Teinforcement, Psychometrika, 22 (1957). 113-132. 

5. _ Of models and men, Amer. Psychologist, 12 (1957), 609-617. 


des chaincs dt liaisons completes, Bull. Soc. Math. 


428 READINGS IN MATHEMATICAL PSYCHOLOGY 


6. W.K. Estes, and Patrick Suppes, Foundations of Statistical Learning Theory, 1. The 
Lincar Model for Simple Learning, Technical Report No. 16, Contract Nonr 
plied Mathematics and Statistics Laboratory, Stanford University, 1957. 


An abridged ver- 
sion appears as Chapter 8 of Studics in Mathematical Learning The ory, edited by R.R 
Bush and W. K. Estes, Stanford University Press, 1959. 

7. T. E., Harris, On chains of of infinite order, Pacific J. Math., 5 11955), 707-724. 

8. Samuel Karlin, Some random walks arising in learning mods I, Pacific J. Math.. 3 
(1953), 725-756. 

9. Maurice Kennedy, A convergence theorem for a certain class of 
Pacific J. Math., '7 (1957), 1107-1124. 


STANFORD UNIVERSITY 


Markoff processes, 


FINITE MARKOV PROCESSES IN PSYCHOLOGY* 
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Finite Markov processes are reviewed and considered for their usefulness 
in the description of behavioral data. The various alternative responses in 
an experimental situation define a vector space, and changes in the probabili- 
ties of these alternatives are represented by movements in this space. Meth- 
ods of fitting the theory to experimental data are considered. 

The simplest process, with a constant matrix of transitional probabilities 
that is applied repeatedly to represent the effect of successive trials, seems 
inadequate for most learning data. A matrix function that may be useful for 
learning theory is presented. 


In the two general areas where psychology has been relatively successful 
as a quantitative science, i.e., sensory psychology and test construction, 
probabilistic considerations long ago proved their worth. It is characteristic 
of these two areas, however, that the observations are relatively invariant 
in time. The basic parameters can be explored at length because sequential 
effects of measurement are secondary and can be ignored or randomized. 
This fortunate situation makes it possible to use familiar probability models 
based upon independent random variables. 

With the more dynamic problems of psychology, however, this familiar 
model has not often led to profitable results. For example, it is intrinsic in 
the very notion of learning that successive measurements are not inde- 
pendent; attempts to use a theory of independent variables must either fail 
Or misrepresent the basic process. Such failures may lead to a rejection of 
equate; a more proper attitude is to abandon the 
assumption of independence and ask what help can be had from dependent 
Probabilities. The simplest mathematical models incorporating dependent 
Probabilities are the finite Markov processes. In this paper such processes 
are examined for their usefulness and their limitations for describing psycho- 


logical data. 


statistical concepts as inad 


1. Simple Markov Chains with Two Alternatives. The data from psycho- 
logical experiments usually come in the form of sequences of choices em- 
bedded in the time continuum. Often it is possible to ignore the temporal 
order in which alternative choices occur. The purpose of this discussion, 


for Advanced Study in Princeton, New 
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however, is to examine situations in which the temporal sequence should 
not be ignored. We shall adopt the Markovian model of 
abilities to discuss such sequences. W 
possible example of a Markov chain. 
Consider an experiment in which onl 
possible. A trial consists of a choice of one of these two alternatives. If the 
letters A and B designate these choices, then a sequence of trials might 
produce the sequence of responses ABBAAABA ... » Where the durations 
and latencies are ignored. We shall assume that this sequence is produced 
by a Markov process; i.e., that the distribution of probabilities at trial n + 1 
depends upon the outcome of trial n. However, the knowledge of outcomes 
prior to n does not change our description of the System if we know the 


outcome of trial n. In other words, the present state of the system governs 
its future development. 


We adopt the following notation: 


dependent prob- 
e begin, therefore, with the simplest 


Y two alternative responses are 


n number of the trial: 0, LS aor 
A and B the two alternative responses. 
2” (A) probability of alternative A at trial n. 


n(A) asymptotic value of p 


(4) as noo. 
d, 


the set of absolute Probabilities at trial 1, considered as 

a vector; [p°™ (A), p(B). 

Da(B) given A at n, the conditional probability of Bat n + 1. 

P4™(B) given A at n, the conditional probability of Bat n + m, 
0 =2, 8 

7 


matrix of transitional probabilities. 
A; 


characteristic roots of the matrix T. 


Alternative A can occur at trial n + 1 in either of tw 
follows an A on trial 1, or it follows a B on trial n. Sim 
at n + lin either of two Ways. This obv 
equations: 


0 ways. Either it 
ilarly, B can occur 
lous fact leads to the following 
p"(A)pA(d) + p‘™(B)pa(A) = pA) 


(n) ( D 
p™" (A)pA(B) + D™(B)pA(B) = Al j 


In matrix notation these equations can be Written 


Le pal) 
Da (B) pPa(B) 


D(A) 
p(B) 


টড es : @ 


p(B) 


miliar with the elem 
eS On trials n and n 
mensional Space, the 


The reader is assumed to be fa, 
If the distribution of probabiliti 
vectors d, and d,., in a two-di 


ents of matrix theory. 
T+ 1 is regarded as the 
1 the square matrix of 
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transitional probabilities is a linear transformation or operator mapping ds, 
into d,,, . Thus we can write Eq. (2) as 


TAL =: dei, (3) 


Any sequence of distributions can be produced by operating upon the 
successive d, by appropriate transformations. For the moment, however, we 
shall consider a special case. We shall assume that repeated trials can be 
represented as repeated transformations by the same operator. Thus we 
can write for the initial trial: 


Ts = dis 


A second trial carries d, into ds» : 


Td, = ds. 


In terms of d, , therefore, we can write: 
Td, = (Td) = Td, = d. 


Or more generally, 
Td = 4) 


Since the probabilities of A and B on successive trials are given by 
Td, , we proceed to examine the powers of T. The elements of T" are pi” (J), 
where i = A,B; j = A,B. We wish to find a general expression for Tin 
terms of pi(f) and n. From matrix theory we know that every square matrix 
with distinct roots is similar* to a diagonal matrix whose diagonal elements 
are the characteristic roots A; of T. We designate this similar diagonal matrix 
by A, and write 

A= SETS; 
Where S is a matrix whose columns are the characteristic vectors of T. From 
this we obtain 
T= SMS, 
To obtain the powers of T we note that 
T* = SAS™SAS™ = SAS, 


Or more generally, iS 
If = SAB 5. (5) 


Powers of A are simply calculated, for since Ais a diagonal matrix, its powers 


are given by the powers of the diagonal elements 2; . 
To find A for the matrix of Eq. (2) we first write the characteristic 
equation for the matrix T. If we use the fact that pA(A4) + pA(B) = 1 (and 


*T wo matrices are said to be similar when they have the same characteristic roots. 
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similarly for B subscripts), the determinantal equation can be written in the 
convenient form 
det (T — AM) = X — [pA(4) + pA(B)A + [pA(d) — pA(A)] = 0. 


The roots of this equation are the characteristic roots of the matrix: 
MN =1 and As = pA(A) — pa(d). 


Since the sums of all the columns of T are unity, we note that unity is always 
2 root of these matrices. Substituting these roots into Tv; = Av; and solving 
for the characteristic vectors, v; , we obtain the vectors [1, pA(B)/ps(A)] 


and (1, —1). These vectors comprise the columns of S, and so from Eq. 
(5) we obtain, after inverting S, 


1 i 0 1 
T= SEE SEES 
le il [pA(d) — টা DA(B) + pa(d) 


J oe i 06) 
PA(B)  —pa(d) 
Eq. (6) can be written more conveniently 
{ ! Ba 
7 PAB) + pA(d) p(B) p(B) 


PA(B)  —pa(d) 
_DPA(B) Dold) 
Since | pA(4) — pa(d) | < 1, the second term on the right of Eq. (7) goes 


to zero as n —>~, so the first term represents the asymptotic form of T". 


With Eq. (7) we can calculate Td, , and so obtain the probability of 
A on successive trials: 


re Ipa(d) A(d) — pa(A)]" 


xD 
PA(B) + Da(A) 


(n) ও Beth) 
EEE 


+ P(A) — pA) 2 (A)pA(B) — p(B)pa(d) 
DA(B) + pa(d) 


(8) 
The value of 


0A) Ss Da(A) 
7 EB) nO RE 


It is apparent that Eq. (8) can be written 


P™(4) = all — bee), (9) 
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where 


pal) 
PA(B) + Pal)’ 


ENE: ©. pa(B) (0) 
b= pC) OD TE) 


& = =I BAA) = Da(A)]. 


Eq. (9) is an exponential growth function—a form frequently used to de- 
scribe data from learning experiments. It should be noted, however, that 
while the average subject may follow such a learning function, the individual 
subjects are generating stationary time series that do not represent learning. 
The term “learning” probably should be reserved for those cases in which 
the matrix operator changes on successive trials. 

We shall illustrate the use of the Markov chain with a numerical ex- 
ample. Suppose that two alternative responses are called right (R) and 
wrong (IY), that p”(R) and p'"(W) are measured by the percentage of 
Subjects in a large sample that choose R and TV on trial n, and that the 
transitional probabilities observed on successive pairs of trials are constant. 
Assume the following numerical values for T do = di: 


.97 , { et 
08. 78).-01 .73 
is followed by another right response 97 per cent of the 


Ss wrong 73 per cent of the time. From Eq. (8) we calculate 
9, .68, ete., approaching 


A right response 
time; wrong follow 
that the successive values of p‘™(R) are 0, .27, .46, .5 
the asymptote of .90. The equation is 


poh) = A= tt) (0 LL) 
If we know that on a particular trial a W occurred, this equation gives the 
Probability of R on the nth succeeding trial. 


A simple parameter of such Markov chains 
tion it now because for the more 
ocorrelation function is either 
m the matrix of transitional 


2. Autocorrelation Function. 
is the autocorrelation function. We will men 
complex cases we wish to consider next the aut 
not defined or is most tedious to compute fro 
Probabilities. 

The autocorrelation function is the correlation of a time series with itself 
displaced 0, 1, 2, ... steps. With zero displacement the correlation of the 
Series with itself is, of course, +1. With a displacement of one step, the 
responses on trials 1, 2, 3, ... are correlated with the responses on trials 
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2, 3, 4, ... . Tf the series of binary choices is fairly long, the autocorrelation 
after a displacement of one step is given by 
Tf, = pA(A) — Da( A). (10) 
We note that 7; is a characteristic root of the matrix of transitional prob- 
abilities. More generally, 
tn = DE (A) — pi™(d), AD 


where p4™(A) and p§™(A) are elements of T™. From Eq. (7) we observe 

that these elements of T™ are 

pi” (A) es Dpa(A) + PA(B) [pA(A) ই Da(A)]" 
ha DA(B) + pa(d) 


and 


(0m) _ Dald) — Pal A) [pA(Ad) — Dal A)]" A 
RN PA(B) + poll) 


When these values are substituted in Eq. (11), 


we obtain 


tn = [pA(d) — pu(d)]" = rr. (12) 
In short, for a simple Markov chain, the 
1 and n + mis the mth power of the 
Tf |r, | < 1, then |r | declines monot 

A simple example is provided by 


autocorrelation between positions 
autocorrelation between n and n + 1. 
onically toward zero. 

the Samoan language. E. B. Newman 
has noted that. the sequence of consonants (C) and vowels (V) in Samoan 
writing is adequately described as a Markov chain with the. following matrix 
of transitional probabilities: 


oe nl ee “ 
Pe(V) pr(V) I 51 
Consonants never follow consonants in written Samoan. The autocorrelation 
function is easily computed from this matrix. For successive displacements 
of one letter the value of the correlation coefficient is 1, — 49, .24, —.12, 
‘06, —.03, etc. 

The autocorrelation function for this simple process can also be de- 
scribed as the determinant of T™, Thus To is the determinant of T° = Ln 
is the determinant of T, r; is the determinant of T°, ete. 

When the distribution of pro 


babilities at n 
prior to n as well as upon n itself, Eq. 


3. Extension to More than Two Alternatives. The extension of the matrix 
equations to experiments involving more than two alternative responses is 
straightforward. Designate the alternatives HB ores » N. Then we have 
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pad) Das(ld) ‘-- Dpy(d) p™(4) p24) 
PA(B) DPA(B) --- p(B) p(B) p(B) 

= | . (13) 
DA(N) DA(N) --- DS(N)) (P(N) pe 


General solutions are known for certain types of operators. These are of 
considerable interest in physics and genetics, where the elements of T are 
given by theory. The present use of such operators is almost purely de- 
scriptive, however, for we do not know what special types of matrices will 
be of the greatest psychological interest. 

It is not always necessary to find a general solution. A qualitative un- 
derstanding of an experimental situation is often provided by simply trans- 
forming the initial distribution five or ten steps by direct matrix multi- 
Plication. For example, a learning situation might be analyzed into three 
kinds of responses: correct (C), slightly wrong (S), and grossly wrong (G). 
During the course of learning a subject begins by making gross mistakes, 
then slight mistakes, and finally manages to make correct responses. Such a 
situation could produce a matrix equation like the following: 


pelC) psc) re) (0) { 93 hk! { 
Td, = SPe(S) ps(S) ন pia = lL .6 ALN 
| ps(G) Da(G)) (GG) 6 At oN 

It is tedious to find the general solution of T", and it is easy to see by direct 
multiplication what happens. The proportion of grossly wrong responses 
declines steadily: 1, .7, .52, .40, .32, 26, ice. 4 “0B. ‘THE proportion of small 
errors on successive trials at first increases, then decreases: 0; 8; :39, 40, 
$38, .35, ... , .23. The proportion of correct responses gives a roughly iS 
shaped function: 0, 0, .09, .20, .30, 88, 45, +5 369. This situation is analogous 


to pouring water from one vessel into a second, which in turn pours the water 
into a third. The asymptotic distribution can always be found by solving 


the equation Td, = d,. ত ; 
The form of a general solution can be indicated, for finite matrices with 


distinct roots, as follows. Let A; represent the N characteristic roots of the 
Polynomial det (T — AM). We define a set of matrices f(T) by 


f(T) 
(f= MDCT =D = — Neal = Rell =  — Wd) 0) 
[As — Mi): Na) == A: = Nii — Naan) =" Qs = Ay) 
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In terms of these matrices, T can be expressed 


T= NAT) + AT) + ce FAST). 05) 
If gO) is a rational scalar polynomial, then 
AT) = 9OA)SAT) + gOIAT) + -.. + GOASIN(T). (16) 
In particular, if JA) = AM, we have 
T= NT) 4 RED) SE T+ ANS N(T). (17) 


The 2 X 2 transformation is expressed in this form in Eq. (7). Concerning 
the roots A; , we know that XN, can be assigned the value 1, and that all the 
other roots fall between —1 and +1. Thus the Asymptotic value of T" is 
given by f,(T). 

The solution for a particular matrix can always be obtained by (a) finding 
the roots of the characteristic Polynomial, det(T — AI); (b) determining the 
f(T) according to Eq. (14); (ce) substituting into Eq. (17); and (d) solving 
Td, for the given boundary conditions of d, . This procedure has the ad- 
vantage of avoiding the problem of inverting a large matrix, but if two or 
more roots are nearly the same, the computations may be quite difficult. 

The autocorrelation function is not defined for more than two unordered 
alternatives, because the value of the Correlation coefficient varies according 
to the various Possible assignments of numerical values to the different 
alternatives. However, the determinant of the matrix of transitional prob- 
abilities has many of the characteristics of a correlation coefficient, and in 
the 2 X 2 case the determinant and the autocorrelation coefficient are 
identical. The determinant of T", as a function of 1, lies between +1 and —1, 
declines toward 0 for the Markov Processes, and can reveal periodicities in 
much the same way as an autocorrelation function. The possible usefulness 
of this extension to N x N transformations needs to be explored. 


4. Eztension to Compound Responses. 
an inconvenience that Markov mory. We must now 
remove the restriction that, i "ial fis known, events 
Prior to n are irrelevant for predicti n + 1. We Ugh con- 
sider the non-Markovian Case. What we must do is to expand the definition 
of a state of the System in order to make such Systems Markovian in @ 
larger space. Eh 0 


Tf the probabilities at trial n + 1 depend u 
and n — 1, but knowledge of events prior to n 
diction for n + 1, we have a non-Markovian Syst 
to be Markovian by changing the definition of ar 
acterizing the state of the system by 
we characterize it by pairs of 


Pon the outcomes of trials n 
7 1 does not change our pre- 
em. This system is made 
1 event. Instead of char- 
the occurrence of a single response, 


Tesponses. Tf there are two atomic alternatives, 
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A and B, in the original system, then there are four compound alternatives, 
AA, AB, BA, and BB, in the new system. Thus we must define a distribution 
d, over four alternatives, and T is a square matrix of fourth order: 


Paa( AA) 0 FEES) 0 Pp (4A) 
PaAa( AB) 0  pia(AB) 0 2” (4B) 
0 DPan(BA) 0 Des(B4) | |p (BA) 

0 DaAn(BB) 0 Dns (BB) p“™(BB) 
pA YS 

EAR EBL odie 08) 

p°"*" (BA) 

Pp” (BB) 


Td, = 


Note that many of the transitional probabilities are zero; it is not possible 
for the system to move from some state to others in a single step. For ex- 
ample, the system cannot move from AA to BB in less than two steps: 
AA — AB — BB as in the sequence AABB. 

Tabulations of sequences of vowels and consonants in written Hebrew 
have been made by E. B. Newman. The sequence of consonants (A) and 
Vowels (B) can be adequately represented by a matrix of the form of Eq. (18): 


0 0.28" 0). 098 
L 0° OL AIO 
0 81 0 .90| |.410 
0 19 0 10) .085, 


As before, the transformation T can be applied iteratively to carry any 
initial distribution into a final, unique, stable distribution. 

This extension of the Markov process can be carried as far as the data 
seem to merit. For example, fixed-ratio reinforcement in operant conditioning 
requires an animal to respond m times in one way, then approach the food 
tray. In order to keep track of the sequential aspects of this behavior we 
could define a state of the system to include all the possible sequences of 
responses and approaches of length m + 1. Thus there would be 2° alter- 
native states, and the transformation would be of order 2"**. More complex 
sequential dependencies arise in human verbal behavior and can be treated 
in a similar manner. The verbal case is so complex, however, that it cannot 
be adequately discussed in this paper. 

In principle it is possible to extend the Markov definition indefinitely 
to take into account as much of the past history of the system as one desires. 
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Cases are known, however, in which the extension would need to be carried 
infinitely far into the past in order for the Markov model to summarize all 
the information. Such cases are better handled in other w 
seems likely that most learning situations will need to be 
other methods, and that Markov processes using 
Probabilities are most valuable when the behavi 
stable pattern. 


AyS. At present, it 
described by these 
2 single matrix of transitional 
or has settled into a relatively 


5. Least-Squares Fit to Data. Under the 
formation describes the behavior, every trial c 
of the single transformation T. We Wish to fi 
will give the best estimate for T from the available data. The following pro- 
cedures may not be the most efficient for Markov Processes, but they represent 
One fairly natural extension of the procedures used with more familiar statis- 
tical problems. 


assumption that a single trans- 
an be considered a measurement 
nd a least-squares solution that 


We introduce a matrix M to represent the observed data. This matrix 
is formed by placing in successive columns the distributions observed on 
Successive trials, from trial 1 through trial n — 1. Tf each distribution con- 
tains a alternative quantities, and n su 


: { n ch distributions are known for suc- 
cessive trials, then Misanax(-1 matrix. A matrix N is formed 
analogously by placing In successive columns the distributions observed on 
the Successive trials from 2 through n. Thus N is also an a X (n — 1) matrix. 
The matrix N represents the best estimate of the successive distributions: 
N=N+0, (19) 
Where the elements of the matrix C ar i 
টী X C are the corrections the 5 added 
to the observed values in N {0 give tnd how 
We wish to determine T, the best esti 
wi in 80 estimate of the transformation. From 
the definition of M and N and 2 single operator throughout 
learning, we have the equation: 


TIM=HW=N+c. 


(20) 
From Eq. (20) we Obtain an expression for C: 


C= _N + TM রর 
For a least-squares solution, CC” must be 
putting the partial derivative with respect 


(21) 


& minimum. This is obtained by 
to T to zero: 


0 

CC = Hor = 
Fi Mo =o. 
We now substitute for C’ from Eg. (21) into E 


to “q. (22) and obtain 
M(-N + TM)’ = =MN* + MUTT = 0. 


(22) 
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Rearranging terms gives 

Fr = (MM) MN’, 
or 

T= NM'(MM)™. (23) 
Eq. 23 provides a best estimate of T on the basis of the data matrices M 
and N. 

As an example, consider an experiment in a T-maze. We decide from 

an examination of the data that the learning process can be described by a 
Markov process with a single transformation. Suppose that 10 rats were run 
for 20 trials, and that on successive trials the following numbers of rats 
made the correct choice: 5, 7, 6, 6, 8, 8, 8, 7, 8, 9,8, 7,8, 9,10, 10, 8, 8, 9, 9. 
From these data we construct the matrices: 


ie 7 6.6.88 8.7.8.9 8.7.8.9 
BB344222832.123832.1 

10 10 8 8 fl 

0022 
n=; 6 .6.8.8.8.78.9.8.7.8.9 10 
3442.2.2.3.2.1.2.3.2.1 0 


10 8 .8 .9 ‘ 
C224 A 


Next we multiply these matrices to obtain 


12.16 3.14 11.99 2.91 
NM’ = MM’ = চ 
2.74 .96 2.91 1.19 


The matrix MM’ is easily inverted, and we have 
12.16 a 1.19 po স্ব 
2.74 96) (2.91 11.99 5.8? 


= .92 .39 
= ঃ 
08 ‘61 


(5, 5D); and from Eg. (8) we obtain 


I 


T= NM'(MM')™ 


| 


The initial distribution do is 
p(B) = 83 — .33(.63)". 


The values calculated from this equation are .500, .665, .738, .785, .804, ..., 
approaching .83 as the asymptote. Note that we do not have a least-squares 
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fit of this function, (RJ; to the observed data; we have a least-squares 
fit for the transformation T. 


From Eq. (21) we can calculate the corrections that are added to MN: 


y ‘655 .761 .708 ‘708 814 814 814 
TM = 
345.239 :.292 .292 186 186 


‘761 814 867 


‘186 .239 .186 .133 
814 .761 .814 .867 .920 .920 .814 814 2 
‘186 .239 .186 .133 ‘080 ‘080 .186 186 .133 


014 114 -—.039 
=:014. —=014 — 114 .039 
5086: =. 188 =;080 
‘086 133 080 
120 014 — 086 es 
=.120. —i0142 086 088 


2 Re 161 108: =092 ‘04 
‘045 -—.161 -—.108 092 
7.086 .067 114 — 039 
086 —.067 —.1l14 039 


The squared deviations are given by 


aS 144 fey 
~.144 144 


The best estimate of the dispersion of 


A the calculated from the observed 
Values is 


3 ES EE ue 14d ক 


(24) 
The variance-covariance matrix V is given by 
V = (MI)! = i | 1.19 —-2.91 ঠ (25) 
“501 1005 


From Eq. (25) we com 


pute the standard 
PA(A) and p(B): 


deviations of the estimates of 


hig 
c[pA(A)] = .092 V5 = 04 


1.99 

(B)] = 09 —- = 
clpn(B)] 2 5 132. 
The same procedure can Ve 


> applied to the d 
The data matrices Af and 


ata from a single animal. 
N then have cither 0 or 1 on Successive trials; e.g., 
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NE ST ei 
0A O00 LE OOO 00 
i ES 

LADO OL LILSO cL 00 000 


In order to solve for T we determine 
m(l,1l) oi MM = Ee 0 
m(1l,0) m(0,0) 0  m(0) 


The symbol m(i,j) represents the number of occurrences of the ordered pair 
1,j; m(i) represents the number of occurrences of 1; and m(0) + m(l) = 
n — 1, where 1 is the number of trials. Next we invert MM" and solve for T: 


ml,1) m0) LL 
{ li 09 || 


T= NMOMM)' = fj 
Le চা 0 mn(0) 


m(l,1) m(0,1) 

~~ { m(l) m(0) 
m(1,0) m(0,0) { 

m(l) m(0) 


NM’ = 


Eq. (26) is the result that would be expected from the definition of the 


transitional probabilities. 
In order to estimate the dispersion we calculate 


m(l,1) m0, MLE oe m(l,l) 
FM = { m(l) m(0) m(l) m(l) fl 
Ee 0) m(0,0) ml0) ... m(1,0) 
m(l) m(0) m(l) m(l) 
Then from Eq. (21) we find 
m(l,1) =m(0,0) -m(lL0) ... =ml.0) 
PETITE NE { m(l) ml) m(l) ml) | 
| j [Et m(0,0) m(l,0) (tO) 
nll) m(0) m(l) m(l) 


The squared deviations are given by 


c- |] 
Cc 
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where 


(L0)T nl) EA 
c= mc T+ mao 20D] + co, 0:0 


0,D | 
+ mco,0)| "0 | 


0,1) m(0,0) 
nt,0) + mn EDD) + no,0) + mo, 1] “0,0. | 


m(l,1) m(l,0) m(0,1)  m(0,0) 
mn mn) 9 + wo m0) 8 


‘The Jispersion is, therefore, 


_ C EL {[ m(l)  m(l,l) | m0) | 
ME. TT TU ml) ml) 


m(0)  m(0,1) m(0,0) JL? 
+2. m(0) m(0) | : 


The variance-covariance matrix is 


নয 9 
Mt fn C 
V = *(MM'’) = নত ; 
m(0) 


and from this matrix we compute 


olpa(d)] = oF and olpa(B)] = | (28) 


m(0) 


the Markov process becomes more widely applied. 

6. Variable Transformations. Up to this point we have made the ex- 
plicit assumption that a single transformation could describe the successive 
changes in the probabilities of the alternative Tesponses or alternative se- 
quences of responses. This assumption greatly simplifies the theoretical 
landscape and should be made whenever the data hint that it might be true. 
Simplicity is not, however, an intrinsic Property of the behavior of living 
organisms, and so we must be prepared to deal with situations that obviously 
violate the assumption. 
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o The assumption that a single transformation is adequate means that the 
transitional probabilities are fixed from the first through the last trial. Since 
the transitional probabilities determine the sequences of responses that are 
probable or improbable, we are assuming that the animal’s course of action 
or strategy is fixed throughout the experiment. In a certain sense, therefore, 
such an assumption means that there is no learning at all; as soon as the 
experimental situation is encountered for the first time, the subject adopts 
the set of transitional probabilities that will later describe the statistical 
properties of his behavior after he has had long experience in the situation. 

The assumption of a single transformation would be justified, for ex- 
ample, after a long series of alternate conditioning and extinction. In this 
experiment the subject is able to evolve a single transformation for the re- 
inforcement conditions and another for the extinction conditions. Or if an 
animal has adopted a stable mode of behavior in a situation and then is 
temporarily distracted in some Way, his return to normal when the im- 
pediment is removed might be expected to follow a single transformation. 
But in most of the situations that are studied experimentally there isno a 
priori reason to expect that a single transformation will be adequate, and 
there are several reasons to expect that it will not be. 

In order to illustrate what is involved in the assumption of a single 
transformation, Table I has been prepared to show one case where the 
assumption is correct and another where the assumption is wrong. Once more 
we consider the data from 10 rats on 20 consecutive choices in a T-maze. 
The symbol 1 represents correct choice, and 0 represents an incorrect choice. 
In Tables IA and IB the numbers of rats making the correct choice are the 
Same, and both are the same as the example fitted in the preceding section. 


TABLE 1 
Hypothetical Date for Ten Rats on Twenty Trials in a T-Maze 


IA. Constant Transformation 


Trial 
Rat 12845 67 8910 112181415 16 17 18 19 20 
1 11000 00011 LET LL LAT 
2 00000 LLL TITLLLY 1I0LlLA 
8 tfEEEL {IIT ff LLL 11000 
4 Oo11ll 11111 00001 LLL 
5 LILIILE TEI EL ELLY 10011 
6 OoO11l1l III1IL1 LILLIE L fFEALLELL 
7 o0001 LIAIES JELLY IIIT 
Ri LALIT o0000 0 0I1ITL iI 
9 o0o001 121111 LOGI LILITH 
10 LLL 11001 LLALISR 11111 


10 


[| 
on 
এর 
2 
a 
w 
[ *) 
w 
4 
Le) 
() 
Lo) 
| 
[oo] 
EE) 
5 
[o) 
[o) 
Eo) 
Ke) 
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TABLE 1 (Continued) 


IB. Variable Transformation 
Trial 
Rat 


= 
0) 
2) 
= 
[5 


B78 G0 11 12 13 14 15 16 17 18 19 20 
11 E01. EO A TA Oe NLD ESE! 
12 0: 0:6 0° LE i OT 0 TAHA LOT 31 
IB dra US eA TL. 00 LOO 
IE ILO i Ue 0G NET 000 LORS ES CE 
IB L104 2 ME Eo De ITD 100371 
16 BL 0 LT E00 Le i HE ee A EL) 
17 (Ee EH s iE SEG EE 01 La HE 
18 BE Lt 00 Hk YE le 10 GM Le A 
19 LL 00:04 QOL tee LIL 
20 OL RSD 10 OR Le EE DLE LL LO 
ড় 57 06-8 8:8 7 5-0 87 HOI 1088-8 


From the data in Table I we can estimate the values of Pi(1) and po(0) 
On successive pairs of trials by [mf(G, I))/mG): 


IA Trial n(l) po(0) IB Trial n(l) po(0) 
1-2 1.00 0.60 1-2 0.60 0.20 
2-3 0.86 1.00 2-3 072 0.67 
8-4 1.00 1.00 3-4 0.60 0.50 
4-5 1.00 0.25 4-5 0.88 0.25 
5-6 0.88 0.50 5-6 0.75 0.00 
6-7 1.00 1.00 6-7 0.88 0.50 
7-8 0.88 1.00 7-8 0.75 0.50 
8-9 1.00 0.67 8-9 0.72 0.00 
9-10 1.00 0.50 9-10 0.88 0.50 
10-11 0.89 1.00 10-11 0.78 0.00 
11-12 0.88 1.00 11-12 0.75 0.50 
12-18 1.00 0.67 12-13 0.86 0.88 
13-14 1.00 0.50 13-14 1.00 0.50 

14-15 1.00 0.00 14-15 1.00 0.00 
15-16 1.00 4: 15-16 1.00 

16-17 0.80 as 16-17 0.80 + 
17-18 0.88 0.50 17-18 0.88 0.50 
18-19 1.00 0.50 18-19 1.00 0.50 
19-20 1.00 1.00 19-20 1.00 1.00 


There seems to be a clear 
trials, whereas no trend for Di 
by fives to secure more reliabl 


trend in IB for Di 
(1) is observable i 
€ estimates, we get 


(1) to increase on successive 
n IA. If we group the trials 


IA Trials pil) 


po(0) IB Trials n(l) po(0) 
5 0.94 0.67 1-6 0.72 0.88 
6-11 0.95 0.89 6-11 0.85 0.830 
11-16 0.98 0.63 11-16 0.98 0.838 
16-20 0.92 0.60 


16-20 0.92 0.60 
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Comparisons such as these show that the assumption of a constant 
transformation cannot be checked by the successive distributions alone, for 
IA and IB are identical in this respect. The assumption is justified if the 
analysis of short sequences of trials shows relatively constant transitional 
frequencies, as in IA. If the transitional frequencies show a definite trend, 
as in IB, the assumption is not justified. 

The question is what to do when we face variable transformations. 
Whatever we do, the situation will not be simple. If ... PQRST do cannot 
be translated into. = TTTTT do, the matrix products may get quite com- 
plex. If we could choose P, 0. RST HS commutative matrices, it would be 
possible to find a simultaneous solution for all of them; all matrices would 
have the same characteristic vectors but different characteristic roots. Un- 
fortunately, however, it does not seem possible in general to choose com- 
mutative matrices with the properties demanded by the data. 

If the complexity of the problem is admitted as inevitable, we can still 
look for a matrix function of 1, T(n), that changes in some reasonable way 
on successive trials. The following argument illustrates one possible approach. 
We assume that at the beginning of the experiment the subjects are equipped 
with transitional preferences given by the matrix U. After long experience 
in the situation the subjects develop transitional preferences given by the 
matrix V. As the experiment progresses the tendencies represented by U are 
slowly extinguished and those represented by VY are slowly strengthened. 
Consider the following sequence of equations: 


TO) = UV 
Tl) = TO) (1—-w)V 
T(2) = uT() + (1 =wWV (29) 


Tm) = wT — 1) + (1 - uw), 


where 0 < w < 1. The rationale for this set of equations is that w represents 
the perseveration of the tendencies on the preceding trial, and (1 — w) 
represents the ability to adopt the new mode of response symbolized by V. 
If the extinction of the old pattern of responses is slow, w is near unity; if 
the old pattern extinguishes rapidly, w is near Zero. 
Eq. (29) can be written in terms of U and V: 


TO) =U = 00 VV 
TO0)=vU+(A- WY =U —- VV: 
TOY=wU+FU-W)V= DOE VY FE (30) 


Th) = UF =H) =e Y)T+ lL 
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In this form it is clear that, since 0 < w < 1, T(n) approaches V as n in- 
creases. The importance of U becomes progressively smaller as the subject 
has more and more experience in the experimental situation. This formulation 
has the advantage that it is relatively easy to compute the successive values 
of T(n), given U and V. The initial and final matrices, U and V, can be given 
theoretically or can be determined from data obtained prior to the first trial 
and after the learned behavior has stabilized again in the new course of action. 
For illustrative purposes, assume that U and V are known to be 


-5 .5 9 4 
U= | | and V= | |, 
.5 .5 JI 


and that the weight w is calculated to be 0.8. Then Eq. (30) gives 


5) = | ' 4 | fl 
4 ~—.l ll, 46 
Then on successive learning trials we have: 
#: 01 2 3 4 5 6 8 9 10 
PA(A): .5.58 .644 .695 ‘786 .768 .796 .816 .832 846 .857 --. 
DA(B): .5 .52 .536 .549 ‘559.567 .574 .579 ‘588.587 .589 --- 


Next we calculate the proportions of right and Wrong responses on successive 
trials. This is given by the equation: 


T(0)d, d; 
T()d, = d, = TOUTO)d, 
Td, = d, = 


T(T(IT(O)d, (31) 


It is assumed that T(0) = U and d 
mentation. Assume the boundary con 
putation gives the values: 


o Are known fro 


ho Im preliminary experi- 
dition d; = 


(5, .5). Then direct com- 


t:1 2 E] 4 5 6 


7 8 9 10 2s oo 
P(R): .5 .53 .559 .587 .614 .639 ‘662 ‘688 .700 .716 ... .800 
Considerable care must be taken With such iterated computation, for the 
errors are cumulative. 2 


It should be noted that if w = 0, the variable case Tedivestotieaonaiant 
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case, for then T(n) = V and IT() = T". Similarly, fw = 1, then Tm) = U 
and we again have 2 single transformation. 

A special case arises if U and V commute, UV = VU, for then T(n) 
and T(n + hk) also commute. If two matrices with distinct roots commute, 
then one can be written as a polynomial in terms of the other, with scalar 
coefficients. Thus if the matrices A and B commute, we can write, according 
to Eq. (15) and (16), 

B= Aifi(B) + A2]2(B) +--:+ Awfw(B) (32) 
A = 9(B) = JOQ)fAB) + 9ODSAB) + °° + gOs)fx(B), 


where A; is the characteristic root of B; gO) is the characteristic root of A; 
and for matrices of transitional probabilities A, = gO) = 1. Thus A and 
B have different roots, but f:(4) = f:(B). Another way of saying the same 
thing is to note that commutative matrices are transformed into their diagonal 
form by the same operator. Thus if S transforms A into the diagonal form 
As, S also transforms B into its diagonal form As. The product of A and 
B is (since the diagonal matrices Aa and As obviously commute) 


AB = (SAAS Y(SAsS™) = SAAAsS™* = SAsAAS™* 
= CSAS JIMS) = BA. 


If the matrices T(t) commute, then 
0 0 
IITo = I a0 |S", (38) 


where the A(i) are the diagonal matrices similar to T(t). The product of the 
T(i) reduces to the product of diagonal matrices. If all of the A(t)’s are 
equal, then Eq. (33) reduces to the constant case given by Eq. (5). 
Commutative matrices occur when the distribution over the several 
alternative responses does not change, although the transitional probabilities 
do change. If U has been applied repeatedly, U* approaches f,(U) as a limit; 
after V has been applied repeatedly, V” approaches fi(V). When U and V 
commute, fi(U) = fi(V), and so both transformations lead to the same 
stable distribution. Such a situation might arise in learning a simple alterna- 
tion between left and right. The learning might leave D(L) = D(R) = 5, 


although the transitional probabilities were altered. 
i uggest some of the descriptive possi- 


. This discussion of learning should SUEE: ) 
bilities of systems of dependent probabilities. By this general development 
escription of complex behavioral changes— 


We arrived at a mathematical d 
a description that enables us to talk about the gradual replacement of one 


Pattern of responses by another. 
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ON THE MAXIMUM LIKELIHOOD ESTIMATE OF THE 
SHANNON-WIENER MEASURE OF INFORMATION 


GEORGE A. MILLER 
AND 


WILLIAM G. MADow 


The limiting form and the first two as: 
bution of the maximum likelihood estimate 0 
of information per observation drawn from a 
Also, approximations to the bias and the me: 


Ymptotic moments of the sampling distri- 
f the Shannon-Wiener measure of amount 
multinomial distribution are determined. 
An square error of the estimate are given. 


Preface 


The statistic defined by Shannon (3) and by Wiener (4) to measure the amount of 


information in an event drawn from a multinomial distribution has been adopted by 
some Psychologists to measure certain aspects of stimulus and response events in 
Psychological experiments (2). In these applications, however, the psychologist is 
usually forced to work with relatively small samples and the sampling distribution of the 


measure becomes of real interest. In the Present paper the first two moments of the 
asymptotic distribution are derived and the bias of the Statistic for small samples is 
explored. 


1. The Limiting Distribution of the Maximum Likelihood Estimate 
of Amount of Information 


If an experiment or Operation has k 
abilityp,; > 0,;i = ৷ বত 4: 
tion per performance of this o 


Possible results, the ith of which has prob- 


e Shannon-Wiener measure of the amount of informa- 
Pperation or event is 


pas ais nction of p;, » Pi for all positive values of the prob- 
abilities, it follows that the maximum likelihood estimate, H’, is 
ৰ 
: n nH; 
Hs Et 0g — 
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performances, and where, if n; = 0 for one or more values of i, we define the corre- 
sponding terms (niln) logs niln of H’to be O. 

We will now show: (a) If the p; are not all equal, then H’has a normal limiting 
distribution: and (b) if pi = 1k, i = 1,...,k, then H’ has a chi-square limiting 
distribution with k — 1 degrees of freedom. 

As a preliminary, we obtain H — H’ in a form that simplifies the further 


calculations. 
LEMMA. The difference H — H' is given by the following equations: 
Let 
Tr S ie 
= — loss — 
aM = Be 
and 
(ni; 
| => (7 — pi] logs Pi- (1) 
i=1\ / 
Then 


H-H'=U,t hn 
where p, >0,i=1,..., Kk. Terms in Uy, that have ni = 0 are themselves defined to 
vanish, but terms in Vh that haven; = 0 still yield —pi logs pi- 


PRoor. By simple substitutions we can expand H’ as follows: 


k ni; En; 
HH’ = =>. — logs 7 = 2 OEEPI 


£ Ht ni Fk ni 2 kK 
= 2 RE 7 a -pl logs Pp: 2 Pi logs Pi 


All we need to do is verify that the effects of n;, = 0 are as stated. Suppose, for example, 
that n, = 0 but ni; > 0 otherwise. Then from the definitions of H and H’ we have 


ni ni 


gs [ 
H = H= 2 pi loge Pi +h OE 


and 
Kk Ep n i 
ns ni i tv 
VU, =) log = 2 78 E2 Pi 
n 2 ji Etape i=2 7 n y 1 


So that if we combine the values of Uy and Vi, we verify that H — FH’ = U, + Vn 


THEOREM 1. 2 Es OS 
a. If the p; are not all equal, then \ M(H — H')hasa normal limiting distribution 
bi t 


With mean 0 and variance 


kK 
a2 = > pi (logspi + H)*. 
i=l 
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b. Ifp;i = 1k, i = 1,...,k then (2nflog, e) (H — H'’) has a chi-square limiting 
distribution with k — 1 degrees of freedom. 


The first part of Theorem 1 holds for maximum likelihood estimates almost 


without exception (e.g. [1], p. 500). Also, maximum likelihood estimates are asymptoti- 
cally efficient. We will prove both Parts of the theorem since most of the c 


alculations 
made would be needed in any case for the 


asymptotic moments. Because of the 
preceding lemma, the problem of evaluating Vn(H — H’) can be replaced by the 
equivalent problem of evaluating VnU, + Vv nV. 


PROOF. We first note that if the Pi are not all equal then Vn, has a normal 
limiting distribution with mean 0 and variance a2, We Sketch the proof: The random 


variables Vn(ni/n =P = 5 cake = li Have a Ue = T-variate limiting normal 
distribution with mean values 0, Variances p,q,;, (g;i=1- Pi), and covariances —pipj, 
Lj =1,...,k —1,(#)j). Since the log p; are constant weights applied to these 


random variables, it is clear that Vn V, is a linear combination of the random variables. 
Therefore, VnV, 


n has a limiting normal distribution with mean value 


a  ; 
VAnEV, = V nE > 


(%-2)' 
(n ™ Pi) loge: 


Et f * 
= 1 

¥ "> log, piE(* -pi) = 0, 

and a variance j 


Ly 
-[n; 
= 2 (og pi)* Var E a(t | + 2 (0g, pi) (log, p,) Cov [ a(t -)| 
tz) 


Y 
= 2 (08s po? pig, _ 2 (og Pi) (log, p) pip, 


Ly Kk 
= 2, Pi (logs po? 7 2 (Pi loge pi) (p, logs p,) 
j=1 


LL 
= >, pi(logs pi)? — H? 
t=1 


y 
= 2 Ppillogp;i + HY. 
i=!1 


We next show that \ nU, 


converges in Probability to zer i d 
i i 0 as n increases, an 
that 2nU,/logz e has a chi-square limiting distributi i 
ution with k — edom. 
Let us define 1 degrees of fre 
My; — 
a i — Np; 
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Then 
f 
{J হি 
Un = > = lo Ee 
i=1t npi 
Kk 
= > pil +,)logs (0 +2), 2) 
i=1 
and since n; > 1, it follows that x; > tio = —1 + lInpi > <1. Hence we can apply 


Lemma A.2* and we obtain 


| k ni — pi 3 =I fni= | £ 
logs e A 2 nl যী 2 0 -— নন mpi | + Rj, 03) 


where 
np? 


JOY 


ny — Pi | 0 2 x Pi I= mpi 
= A jG+D Opi)’ 


Lk 
Rie 2 
f=1 np; 
Furthermore, since 
L 
S (ni - np) = 0, 


Wwe have 
n J. CD" EE (n= np)’ 
ot 2 ae VC RE 4) 
where 
iH 


RE 3 ee lett h 
JU + Di pi) 

It follows from (2) that we do not need any special treatment of terms with 
n; = 0 in the approximations to U, yielded by (3), since the appearance of n; asa 
multiplier will automatically cause the corresponding term of (2) to vanish when 
n; = 0. The elimination of terms involving n; = 0 has made it possible for the re- 
mainder terms to be bounded, for if we did not require n; loge ni to vanish when n; = 0, 
it would follow that there would be positive probability that H’ would be indeter- 


minate. 
Furthermore, from Lemma B.1 and Lemma C.2 it can be seen that 


vs | ! 
Pr(Ripy > ©) oli) = of i=) 
and 


n | bs MEP 
ER (i=). where = 7 if jis odd 


if jis even. 


UI. 


Actually, it is easy to see from (4) that we have 
{ya 4 — np) (—1)y™ (nn, — npi)yi 2 
(-—1) (n; — npi) Db) Ip rR 


343° 


) 


R= GTA God  “GFDUTDIA ph 


* The letter “A” in “Lemma A.2" indicates that this lemma will be found in 
Appendix A. 


452 READINGS IN MATHEMATICAL PSYCHOLOGY 


and hence, symbolically, 


Ry. a oz) i (5) E oz) “ ol | | 


Rs: 


Thus Eq. (5) shows that the upper bound of O(l/n’ 3) that we hav: 
unnecessarily large, but the above device is Sufficient 
converge to 0 as fast as 


e found for R;;; is 
য) 
to prove that R;,, and ER;;; 


3 (ni — np)! 
i (np; 
and f f 
বু IS 2? 
E (nn; np,) 
i= (pi) 
WU. 
Now the first term of =—"is 
log, e 
(r= npi)* 
i=l np; y 


which is well known to have a chi-square limiting distribution with k — I degrees of 
freedom, whereas all other terms of 2n U,/logs e converge in probability to zero by 
Lemma C.2. Hence 2n U,/logs ehasa limiting chi-square distribution. On the other 
hand, since 


Vu (EE ") (2) 
logs e/\ Nn 


is the product of a random variable that has a limiting distribution by a variable that 
converges to 0, it follows that \ nU, converges in Probability to 0. 

Thus, if the Pi are not all equal, V\n(H — H') = 
two random variables, one of Which has 
other converges in probability to zero. 
distribution as Vnv,. 

On the other hand, if t 
has the same limiting distrib 
degrees of freedom. 


Vv, + \ nU, is the sum of 
a normal limiting distribution, whereas the 
Hence \(H — H') has the same limiting 
he p, are all equal, then V 


n = 0 and [2n/(logs e)] (H — H") 
ution as [2n/(og, e))U 


ns Namely, chi-square with k — 1 
5 


2. The Limiting First Moment of H — H’ 
By (1) 


EH’ = H _ EU, — EV. 
Since EV, = 0, it follows that — 


we now approximate EU,. 
From (4) we have 


EU, is the bias of #1’. In order to evaluate this bias 


U, . tt En, -— np,)* | (n, — np, 
Jobe nis np; 6n (np,)* 
ME (n, — mp,y! js Es Y || 
he Er ঠা 1) (n, — np,) Re. 


7 Oph ™ 30n (np 
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From Lemma B.1 we see that 


EU, 1 Sipiqi_ 2 AAELEL el BEL s pia + npiqil — 6piq) 
logse  2n 1 np; 6n i (npi)* I2n = (np) 
SLO Pia —P) + npiqqs PIO — 2p) ER; 
20n i pi)! n 
e=) LED LEI L) 
% EN 
2n 6n* 2 pi 5 12n* 2 pi ন (5 


or, combining terms, we have 


EU, k-! || ডি 1-pi 1 ) 
logge 2n I2n8iE1 Pi (s 
k-! | LET || 
E a ol). 
2n 12n° + 2 ad 


Hence, an estimate of H that is unbiased to terms of order 1/nis H’+ (logs ek — D)/2n, 
and an estimate of H that is unbiased to terms of order l1/n* is 

k= loge. logae El 
H’ + (logs ¢) 2n 12n° i 12n2 2 Pi 


Thus, we have proved the following theorem: 
THroREM 2. Under the stated conditions 


f k=l l l ১; 5) | 
EEL = logs eC — TI + TR ti (a f 


2n 


Furthermore, if we let 


k— 


H” = H' + (logs e) SE 


and let 


iY , logs e 
A iG UBS 
then 
H = EH’ + O(l/n), 
H = EH" + O(lln®), 
and 


H = EH” + O(U/n®). 


make several observations about the bias: (1) the term 
n, does not depend on the probabilities p, and hence 
all values of the p;. (2) Since (A — 1)/2n and 
es. His biased downward even to terms 
(Terms of higher order may be negative, 
an H for small values of n.) (3) An 


Theorem 2 enables us to 
of order n 1, namely, (kK — 1/2. 
H” has a bias of lower order than H’ for k 
[CX 1/p,) — 1]/I2n are both positive quantiti 
Of order n 2 for all possible values of the pi. 
Of course, so that EH" or EH “ may be greater th 
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TABEE 1 | | 
Expected Values of the Estimators H', H”", and H™ for the Binomial 
Case when p; = 0.50 and when Pi = 0.05 for Sample Sizesup to 20 


When p; = 0.5 When p; = 0.05 
Sample size | EE 
N | EH’ EH” EH” EH’ EH" EH" 
1 0 “JT21 1.082 0 2 BABS 
2 S00  .86l 951 .095 456 1.058 
3 689 929 969 A131 371 640 
4 781 .961 983 | 153 .333 .484 
5 832 977 ‘990 169 313 410 
6 865 .985 995 | ABI 301 .368 
7 887 990 997 | ol 294 343 
8 .903 993 998 | 199 .289 S21 
9 914 994 999 206 .286 316 
10 .924 996 999 | 212 .284 .308 
|| 
Il .931 ‘997 1.000 | 217 .283 302 
12 .937 -997 1.000 | 222 .282 .298 
13 .942 -998 1.000 | 226 .281 .296 
14 ‘947 998 1.000 | ‘2239 .281 .293 
15 .951 999 1.000 | 232 .280 .291 
16 954 999 1.000 | 235 .280 289 
17 957.999 1.000 238 .280 288 
18 .959 -999 1.000 | 240 .280 .287 
19 .961 999 1.000 | 242 .280 .287 
20 963 999 1.000 | 244 .280 286 
oo 1.000 1.000 1.000 286 286 286 
increase in bias results if one uses [(k — 1)/2n] — 1/1212 as an overall correction and 


omits (21/p;)/12n2, (4) When all the p; are equal, H" becomes 


le yf =1 
H+! ( el 
2 e 2n iy 12n2 : 
(that is to sa 
In order to illustrate the use of the bias 
Case, we state the following: 


Y, if the Pi are unequal, D1/p;i > Kk). 
Corrections of Theorem 2 for a simple 


COROLLARY. For the binomial case, k = 2, we obtain the following estimates of 
H to terms of order n-2:; i 


Ifk =2 and p; = 0.5, then 


2n 4nz 
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If Kk = 2 and p; = 0.05, then 
381 ) 


i t 
H = #1" + (logs (5; + IB 


In Table 1 the expected values of the estimates, EH', EH, and EH", are 
compared with H for the binomial case for sizes of sample up to n= 20. When 
P:; = 0.5, samples as small as 5 give satisfactory estimates of H, but when p; = 0.05, the 


size of sample needed becomes larger. 


3. The Limiting Second Moment of H — H’ 


2 


Since E(H — H’)Y = EV} + 2EU,V, + EU;, we will now consider each of 
these three terms in order. 
a. Evaluation of EV 
We have already seen that 
sf 9 
EVi =; 2 pillogepi + HY, (6) 
i=! 


b. Evaluation of EU, Vy 
By (4), we have 


1+ logse ZL (-—1)" (ni — np)" ন 
UV, = ; 0 — tpi) oe [ ন 2 HDi Op + Ry), 


where Y 
m lS logs e 
Rj = E 2 (n — npi) og) Es Ry | 3 


=1 


By an analysis such as that summarized in (5) we can ignore ER; as involving 
terms that approach 0 more rapidly than the terms we shall retain. Hence 


BE EU, Vy 
ESE SE pO opp +X ECP - tp | 

= 20 = র্‌ (np)! BPE pi iil 

By Lemma B.3 
Elon — npn) nd = — TO — pd 

SO that 

a En; — mpm = npn) টু 2 pS E(n; — np)! 2 Ee 
N ll 2 (pi) EsPh 2h pI go OT 

ow 
— YS pu logs Pn = H + pilogs Pi 
ne 
SO that 
+ E(n; — np"! E@pilogs pIE(, — np) 


T 
Yq 


hod HX (np) i i=1 (np; 
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Hence ৰ ঠ ? 
HEU SE CAP Es E(t — Hp) Pilogpi  F | 
> log, ——+- 
log, e D2 0-1) > (np)! EE Les qi qi 
4 CI} EEG — np)! logsp, + H 
= A — 1) > (np)! | 
Ee i=] 1p, ) q 
S08: pit H চ (—1) E(n; — npiy™! 0 
a= qi seB Mt = D- Op ) 


We shall want to retain all terms of (7) in order (l1/n) or lower. From Lemma 
B.l itis clear that we need consider only terms to/ = 4. Hence, we begin by evaluating 


1 1 np,qq; — pi) 1 3np? q + np, ql — 6Piqi Ys 1 lIOn*prqq,; — pi) 
Hc np; 6 np? ল্য 2 np? 


where we omit the second term of the fifth moment of ni Since it will yield a term of 
order 1/n2,. Then 


|| Il, gil -— 6p qi) Sqidq, — pi) 
a Ll ii il; i 
al: qq, — Po) 211i 6np, ¥% 6np; 


idi qi 
= — PEt on (4p* — pq; + 4). 


By substituting in (7) we obtain 


REU,V, ) YE - 1 (4p: — H 
EEE COE Pi — qPi + 4) (logs pi + H) 
log, e 2 AP HUOEICEML Sc 6n ই pi 
cE 9. iss 
I (ogepi +H) + 7 DBL (8) 
কথ i D 
since i { 


ge 
2 Pillogsp;i + H) = 0. 
i 
Cc. Evaluation of EU¥ 
The first three terms of the approximation to U, , given by (4) will be used, namely, 
nU, lL Ln, — np) 1 


(n, — np,)® 1 £(n npi)' 
ঠি! ~(n; — npi) 
loge ™ 2,4, np; 62 (np )* bs 12 


2151 (pi) 
Since the details are very tedious, they have been put in Appendix D. Here we state 
the result. | 


THEOREM 3. Including terms of order l/n we have 


ARTE y Y 
Hl k LET 1 _ 9° — 20k +7 
OE € 


4 l12n 
and if k = 2, p, = bs we have 


GEORGE A. MILLER AND WILLIAM G. MADOW 457 


Finally, from (6), (8), and Theorem 3, we obtain 


THEOREM 4. In general 


LE 3 logs et 
ER = HE 2 PEC OENRE PY © (log,p; + H) 
[= ন) 
2 log, e L.log,p, + H (logs AK: — 1) 
hE TE = = 
3n* i= Pi 4n* 
(logs eFTk — TLE. 1 (logs el ora 1 
£2 (942 _ 20k +7) + [5]. 
nm 


12n8 I=1Pt 1218 
but if all the p; are equal, then 
(logs el(Kk* —- 1) পট (logs ey 
4n> 1218 


(Tk — 1K 


E(H — H'}: 


(logs e)* 
1218 


(9K? — 20k +7) + ol =). 


Furthermore, if Kk = 2 and pi = 15, then 
Pi / 


= CoE + 5) j ol zh: 


4n* n ni 


E(H — H'Y 


In Theorem 4 we have approximated the mean square error of H’ about H. 
For any random variable H’ we have 
E(H — H'Y} = ofr +(EH’ — HY 


where of, is the variance of H', i.e. 
af = E(H’ — EH. 


Since (EH’ — H)is given by Theorem 2, we can approximate ci by using 


i টে RE Ue k= 1 || 
NE EO RY = (logs 9 | 4n* চু 12n3 ( ! 2 al bi oli) 


Where the mean square error E(H — HJ? will be obtained from Theorem 4. For 
amental quantity. 


estimating H, the mean square error is the more fund 
By way of illustration, consider the binomial case where k = 2 and p, = 0.5. 


Then we have the approximation 
5 3 (logs en + ") (logs e)* 
kd 4n* ( n 4n* 


{* + 2) 2 + 


n 


n 2’ 


An Expansion ford + )log(l +) 


Appendix $3 
+ x) and then derive the expansion of 


We begin with an expansion of log (1 
(1+ .)log( +0). 


LEMMA A.l. Let —1 <p < H; Thien 


io 


x 


log (1 +2) =" 5 


+-:-+( UAE Rise (Al) 
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Aft 
Rin = Tt 


[elit 
Hl 5 (+ D0 +2)’ 


where 


and hence, if 20 <0, 


IR 37 
while, if x, > 0 
| | +1 


Rls a) 


PRroor. If —1 < sz, then 


log( +=) =) 


DLE 
and, if we expand 1/(1 + 1), we obtain 


al 


j wi 2 pi 
log( +2) = 2 (DET +] ন at. 
i=1 


Thus (A.1) and (A.2) hold. Then 


(A.3) follows from the fact that UOA+0)z< 
if x0 < 0 and (A.4) follows from 


the fact that IO +0) <O0if 10 2 0. 
LEMMA A.2. Let —-lI < ty <x. Then 


—])ixi 
(+nm)log(0 +) = DS FR 


52 (0 -— Di Crs 


EX ul 
OEY FEY) = Mf) 
gH =( vET { Tn dt, 


lel 


2 SSG 
(0 Fahy + D* 


where 


and hence, if xy < 0, then 


IR; < 
while, if 20 > 0, then 


PRooF. If —1 <x, then 


te) 


+1 হ 
LTE =O +0log( +0 - fos + 1) dt, 
So that 


(+mlog( +2) = al log( +71) a. 
0 


From Lemma A.], it follows that if « > —1, then 


(— Div 
[os +1)dt = 2 GT =) +f] TET du 


(A.2) 


(A.3) 


(Ad) 


IC + ao) 


A.5) 


(A.6) 


(A.7) 


(A.8) 
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and hence (A.5) and (A.6) hold. If a, < 0 then, from (A.3) it follows that 
Ee x vi 7 lal 

নিই 1-২ Ae Ps 

La jt CF) CFG +D 


So that (A.7) is proved. Then (A.8) follows in a similar fashion from (A.4). 


Appendix B. Multinomial Moments 
Let an operation having k possible outcomes be independently performed n 
times and let n; be the number of occurrences of the ith of the possible outcomes in the 


n performances. Then, 
nt i 
ET INES SDE 
Ce Le Ft 
is the probability of obtaining any specified values of fi,’ ", tp Where ny + + 
Mh=npi2z0p t+ +p =! andpi: is the probability of the occurrence of the 


ith of the possible outcomes in each operation, = 1," °, K. 
Then, it is possible, by easy but tedious calculations to prove the following 


lemma. 
LEMMA B.1. The first six moments of ni; are given by the following equations. 
En; = np; 
E(n; — npi)® = npiqi (whereq; = 1 — pi) 
E(n; — npi)® = npiqiqi — Pi) 
E(n; — mpi)! = 3n2piq? + npigi(l — 6piqi) 
En; — mpi)? = lOnpiq(qi — Pp) + npiqiqip(l! — 12piqo) 
Eni — npi)® = 15ndpiq + Snpiq3[S — 26piq] + npigill — 3O0piqi + 120p}q}]. 
In general, if m is an integer, then 
E(n; — npi)®" = On™) 


and 
E(n; — npi)2™™1 = OO). 


The proof of Lemma B.1 is omitted. 

We shall need not only the moments of n; about its mean but also certain of the 
product moments 
E(n; — pin; — np;)®. 

The following lemma will be helpful in deriving these moments. Its usefulness results 
from the fact that the needed conditional moments will be obtainable easily from 
Lemma B.1. 


LEMMA B.2. Let 2’ be a random variable and let A’ be a random event. Then 


E(x — Ex) = bo E{lE@ | A’) — ExT "E(l" — EQ’ | 4] | 4°). (B.D 


ole —-0)! 


any random variable, then 


Eu = E{E(u | 4°) (B.2) 


PRoor. In general, if u is 
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where A’is a random event and the “ 


“* denotes conditional expectation. To apply to 
general formula we put u = (4 — Ex’ 


)" and note that 
ie v! 


(we —- Ery স [EG 4") ErT EQ | A4)J-. 


DAO —-)! 


Then, since 
ELEC] 4’) = Er] | 4'j = [EC 


A’) - Ex] 
the Lemma follows by substituting for +” — Ex’ in (B.1 ) 

Since we will apply (B.1) for » = 1,2, 3 in the following Lemma, we now write 
out E[(x’ — Ex‘) | A4jfor» = 24 


El’ — Ex’) | 4] = Ec |4') — Er 


(B.3) 

ELE = Bt) L401: (Ee Jd) ES FEL = EG | AP | Ay (B.4) 
EQ’ — Ev} | 4] = [Eo | 4’) —- EP + 3[E("| 4°) — EME — EC’ | ADP | A} 
+ Et — Er" | AP] 4°). (B.5) 


Let us now evaluate some Joint moments for 


a multinomial distribution. In all cases, 
the random event A will be * 


‘nj; has a specified value.” 
LEMMA B.3. We assume a multinomial Population and suppose ij. Then 


E@n,|n,) = np; — FAS = pi) 


En, - np) jn] = - PAD =~ mpi) 


En, — np | nd] = A = tpi) — Pq; Pl) (ES pq; — pi) 
i gt: 


qi 
Ps 
E[n, — pj? | ni] = — FEL = mpip 


3 LAP) 


5 Iq; — p,) 
(0; — np? = BL — (n; — npi) 
3 t চা] i ‘ 
qi gr 
2(q,; — Pilg, — 2p,) = — 2p) 
RY a E (n, np) + Pq: — Polqi — 2p a 
qi qi 
PRoor. If 1 is fixe 


d the conditional Size 


: € of sample is n — n; and the conditional 
ible outcome is Pilg 


probability of the jth poss :- Hence 


En, |n,) = (n — 1) Ls = np, — EE = pi) 
i qi 


E[n, — np) | ni] = on n,) Lid 


9: 


7 tp; 


Pi 
= (n, — np). 
4 Re 
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Also 
EC হয Pi 
nj|ni) — En; = -— i — Hpi). 
Hence j 
E[n, — mpi) | ni] = Lo — pid + On -ni) BIER 
চহ pn EE Pq: — Pi) ETE pig; — Pi) 
qi i qi 
Since n — ni; = nq; — (n; — npi). Finally, 
8 PEE; 
E[(n, — np; | nil] Fr (n; — mpi) — 3 (ni — npn — ni) Par 
i i i 
rE 
+0 mn 2(t 2) (22) 
qi qi 
Pi Pq: — PD) 2 
— = 0 — np +3 EEE (ni — npi)* 
9s qi 
BATE POL = 100 
=" 
_ Pq: Pig 2p;) Et 
qt 
uBR Bid ese + 
qi 
Hence 


E(n; — npi)*(n; — np;)* 


= En; — ils 5 (ni; — npi)* 


E(n; — npiln; — pj) = En; — "ol - PS — "no 


—Hpidi, 


[| 


nt Npiqi = 
qi 


_ Piqi — Pp) EET 


npilqi — pi) 


qi qi 


= a [3n2p?q? + pig! — 6piqgol 


PAqi — 
EE 


Po) 


pq; — pi) 
——— npiqi 
qi Ee 


npiqqi — PO) + 


pip — 6p) (4, - pq: — Pp) 


= 3nipigi + 


Npip; 
qi qi 


+ npipiqi — Pi) 


— Hpi (nn, 
npip; 


ড E (n; 


ij 


— npj)* 


টন > [3pip; t+ (gq; — PM 
I i= Pilg: — Pp; 


2(1 — 6p,q,)  ( 
+> [2 DG q 


6) ng; ng; 


I 
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Now 


Sp;=1-pi=qi, 2 P)=(k — qi —- qi =(k — 2g. 
Jt I#i 
Hence 
(nn; — npn; —npi)* 
DE —EoE p= 


3 
iz5 Pips 


1—6piqi (kk — 2g; — pi) 
EZ + (Kk —-2)gi + En ন 
i 


n n 

! 2 
=3Xpgqi t(k -—k-2)+ ile aa) 6D piqi = & = 2] . 
Appendix C. Order of Convergence and C onvergence in Probability 


If 
lim fn) 


ne wn 


is bounded, we say that f(n) is at most of order 1/n* and write 


f0) = (5) . 


If 
lim n*f(n) = 0, 
nw 
we say that f(n) is of lower order than l/n* and write 
1 
0) = oi) fl 


A sequence of random variables Uj, lp 


converges in probability to 0 if, 
for every € > 0, we have 


lim Pr(lu,l > ©) = 0. 


n+ 0 


(n; — np;)y? 
LEMMA C.1. If 24 > B, then — ] Rt converges in probability to 0 as 1 
becomes infinite. 


PRroor. Using some simple manipulations and the Tchebycheff inequality, we 


: — npilB 
p(t =o > ণ) =n 


so that convergence in probability occurs if 


have 


ক nelBp(2a/B)—1 


ie | -nl > ents) < Pidi 


20 
7 1 >%N, 
ie. if 20 > Bf. 
LEMMA C.2. If 24 > B, then 
Ln; — pile 
i=1  (pi)* 


converges in probability to 0 as n becomes infinite. 
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PRoor. Since 


En; — npil® ) { 
Pr i BIE ST < Pr| at least 0! f 
)y (> EE V iy ( ne 0! 


In; — npil® ‘) 
(pT) 
: a JB 
< - RE ie) 
i= ) 
k 


n 
L278 
Pidt 
ES ne nz fe2lh 216 2 Th 
the result follows if 2% > B. 
Appendix D. Evaluation of EUS 


The first three terms of the approximation to U, given by (4) will be the basis for 
the approximation we use, i.e., 


Et (ni — np! 


nU, 1 En, —-np) (n — ph 
ee ন > 


logs e =D HH 62 Opi) 21 Opi? 
Let us define (ni — tpi)’ 
Wyg “= py . 


Then, excluding terms that will i moments E(n; — npi)’ where j 2 7, we have 


(i ) =5(> চY wis + > Whew i ) 


logs @ izh 
ie ( £ wows + 2 waa) 
6\i=1 izh 
IVE 
+5 চক 52 Wig t+ 2 WngWi3 
($ WigWia t+ B mena) . 
2 \i=1 izh 


d values. Inasmuch as we wish to retain only 
p the terms of higher order as they appear 


We now evaluate the necessary expecte 
terms of O(l/n) or lower we eS dro 
indicating their omission by 

(n — np)! _ 3mpiq? + npiqgil — 6piq) = Sf 2H 6piqd) 


Ewis =E opin চট aps ক 
(ni — npi)s _ TOnphgiq; — PD + padi: — POU 12p,q) 
= (npi)® ত mpi 
10g: —p) , qq — PIA — pig) 
Eh 2 
pi mpi 


(n, — mpi) 

npi)* 

15npigs + Smpigils — 26piq] + piqil! — 30piq: + 120pig;] 
Hl nip 
1563 5435 — 2600) , TO 7 30pig; + 120p8q) 


ম্্ nn 


Ewiy = Ewiswia = 
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SO that 

k 
| Sv + 5S wie | 

t=1 i#1 


Ls Ly 
=3> 9} +3K-D-3y G+ -MNk 1 
i=l i=1 


k u 
[> 629 +k 6k +62 gt - 0-2 
NLi=1q;i i=1 i=! 


NEE 
Me 0 Oia 
i=1 Pi 


Similarly, for the second term, 


k 
El X wows + S WhaWig 
i izh 


=; Es Pi) , l0qq,—p) 3k —-Dgi Kk — Digi =] 
Hp; n n রি npi 


i=1 


= 109g, — pi) — (Kk — DNgi(dp; — qn) 
=1 np; 


i 


(k + 8)q? — (4k + 2)pqi LE gi 
= np; = 2 jp, (te + 8) — Spike + 2]. 


For the third term, 


[3 k 
| > wi +S wa | = (57: = ar) 
i=1 i=1 


For the final term, 


Ly Ls 2 ls 9 
al 2 Wis Wig + D wis * | = b> ov, 15g; ক 3k — | 
i= iZh i= 


np; n np; 


= i 5, +5p+3%k MY 3 Gy 
i 1 Hpi 
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Hence, 
AU Re=1I If (k+8) Lg 
E ) ~ — 2 — 2K +21- - 
( e 4 ৰ 4n [> Pi K i | 6n EL 
k+2 15 Ed I 33k -—-D Eq 
hh 6n LEY 36n 1 pi “ 36n ED I2n j=l pi 
K-11 Sl Eh k+8 Sl 
4 রথ 4n =P: 4n 6n i71DPi 
k+8 S(k + DEK — 1) 5 
i 6n li 6n ™ I2n ie. 
5 Eq 3Gk od 
কল ইঃ 
I2n;=1 Pi 12n pi 
Finally, 


nU, > &k-1 1/2 ') > 
2. |= —— +l > =| = 2% =16 + 9% +2 
e( ) 4 i nS Pi G ) 


+176 — 6k — 3k2 + 2k? + 16k + 10K? 
2n 


+ 10k — 20 — 5k +5 +2 + Sk — 18k] 


nV, 
logs @ 


As a check, e( ) was computed for a special case. Let k = 2 so that 


ny +h =n, ng =n hi, 


ns — tps =n — hn — tps = —(n, — npi). 


Hence, 
nU, 1(n — np l | 1m —- pf 1 
ড় ( Ey ঠ A ত E 
logge — 2 n pin Pr fi 
1 = | 1 j ) 
নী ns Pi in G ঠ 
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and 


1 (On, — np)? 
id = (c 2) 
4 pig ‘36 mipiq: dhl 6 mpg fn 


Us Yl — np! Ln — pit 
(es) 


1 (n, — npi)s 
12 npg! 


nU, l 
2 fs 5) =~ চট rE (3n2 pi 1g) + npiqill — 6piqi]) 


} lSnpiqi(a "2 oP; A 


qi +p. 


36n'piq; 6onpigs 
4 Br Pg 0 ) 
I2irp; 1! 
3.1 -—6pq Sq, - pp Hq pi) 5G +p) 
=-+ + EE ES as $f nL 
El npg, 12np,q; 3npiq, 4npig; 
3 || 5 Y ] 9)]. 
= শ্ব Tang _ I8pyq, + Hq, — pi: — 20g, — pi): + 15g, + pl 


Let p; = qi = 112, so that 


Eo ) a 1 (3 9 15 
log, e 4 ho) $F 


4 
En SSL EEE 3 %, 
ত ব ET. 
IfKk = 2 and p, =, = I12, 
el nU, j RD) j 6 n 
ogee 3 AL 5} (from square of Ist term) 
+0 (from square of 2nd term) 
+0 (from product of Ist by 2nd) 
= 3 
টা (from product of Ist by 3rd). 
From general formula 
3 
Ist term squared ~ lO CT TEED 
4 4 4 2n 
2nd term squared ন 2 $5) = ই = সু টি bd = 
36 hr 1/2 36n  36n  36n 
10 
product of Ist by 2nd —-—2 SE =0 
চ 6n 6n 
3.3% 5.1 || 
product of Ist by 3rd term ত (5 +53) = Bd Eo 


Hence each part checks. 


GEORGE A. MILLER AND WILLIAM G. MADOW 469 


REFERENCES 


[1] Cramer, H. Mathematical methods of statistics. Princeton: Princeton University Press, 
1946. 

[2] Miller, G. A. What is information measurement? Amer. Psychologist, 1953, 8, 3-1 } i 

[3] Shannon, C. E. A mathematical theory of communication. Bell System Tech. J., 1948, 
27, 379-423. 

[4] Wiener, N. Cybernetics. New York: Wiley, 1948. 


A STATISTICAL DESCRIPTION OF VERBAL LEARNING* 


GEORGE A. MILLER AND WiLL J. McGiLL 


MASSACHUSETTS INSTITUTE OF TECHNOLOGY 


Free-recall verbal learning is analyzed in terms of a probability model. 
The general theory assumes that the probability of recalling a word on any 
trial is completely determined by the number of times the word has been 
recalled on previous trials. Three particular 
examined. In these three cases i 


application of these special cases to typi 
An interpretation of the model in terms of set theory is Suggested but is not 
essential to the argument. 


‘The verbal learnin 


£ considered in this paper is the kind observed in the 
following experiment: 


A list of words is presented to the learner. At the 
end of the presentation he writes down all the words he can remember. This 


procedure is repeated through a series of n trials. At the present time we are 


not prepared to extend the statistical theory to a wider range of experimental 
procedures. 


The General Model 


y the number of times the word has 
In other words, the probability that 2 
is a function of k, the number of times 


(Symbols and their meanings are listed in 
per.) 


rd is in state A, . Thus before the first trial 
all the words are in state Ao ; that is to Say, they have been recalled zero 
times on previous trials. Ideally, on the first trial a proportion ro of these 
Words is recalled and so passes from state A, to state A, . The proportion 
1 ro is not recalled and so remains in state Ao . On the second trial the 
*This research was facilitated by the authors’ membership in the Inter-University 
or Behavior Theory Held a rafter escsease Conc onidta eon Modal 


; e 28-August 24 + The authors are 
especially grateful to Dr. F. Mosteller for advice and Ur tiCian! tha ove helpful on 
many different occasions. 


This article appeared in Psychometrika, 1952, 17, 369-396. Reprinted with permission. 
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words that remained in Ao undergo the same transformation as before. of 
those in 4; , however, the proportion 1 — ri is not recalled and so remains 
In 4. 

One general pro 
in state A; on trial n. 


in state A; On trial n. 
on any given trial: 


blem is to determine the proportion of words expected 
Let p(A.,n) represent the probability that a word is 
Since these are probabilities, they must sum to unity 


2 md: nN) = 1. 


The number of trials and the total number of times a word has been recalled 
must assume non-negative, integral values. We assume that a word can be 
recalled only once per trial at most, so the number of recalls cannot exceed 
the number of trials. Therefore, we have 

DAs ,n) = 0 for k<O0n<O0nc<k. 
We also assume that none of the words can have been recalled before the 
first trial, so for n = 0, 


ke: ="0, 
HA 50 1 for 0 


for k=O. 


For all trials we have the difference equation: 
MAs,n t+ 1) = Dd = 4) D(Ar-i , N)TE-1 (LL) 


This equation reflects the fact that a word can get into state A; on trial 
n + 1 in only two ways: (a) either it is in Ax On trial n and is not recalled 
On trial n + 1, or (b) itis in Axi On trial n and is recalled on trial ntl 

The following rationalization for this scheme is in the spirit of the statisti- 
cal theories of learning developed by Bush and Mosteller (1) and by Estes 


(3). The rationalization is not necessary for the development of the math- 
ematics, but it gives an alternative way of thinking about the present model 
the earlier theories. On the first pre- 


and helps to clarify its relation to Hh 5 
sentation of the list of words a random sample of stimulus elements is con- 
for each word. The measure of this 


ditioned to the appropriate response 
set of conditioned elements is ro - (The total measure of the set of all stimulus 


elements for a given word is assumed to be unity, so the measure can be 
regarded as a probability.) If a word is not recalled, the measure of con- 
ditioned elements for that word is unchanged. Lr ys is Ee the 
Proportion of conditioned elements 18 increased. e লা : wy a 
word is to take another random sample of elements from t { E) seber" 
condition them. The proportion of elements conditione { en ্্‌ং 
State A; is recalled is ris1 — Te More precise oe RTE ion 0 5 
theoretical argument will be presented when we consider the specia’ Cases 


the general theory. 
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The general solution of (1) when all the Ti are different is (see Appendix A): 
Mdo,n)=(1- i), fo k= 0 


k n 
Dds, nN) = sor, ee Ts > Ud ==) Rt fork > 6. (2) 


The denominator of each of the fractions in the summation includes all 
differences of the form (7; — 7,) except for the zero difference (7; — ri). 


The expect. : number of times a word is recalled, all told, up to and 
including trial mn, 1S, by definition, 


E(k,n) = SS KDA, nm). (8) 

k=0 
‘The expected Proportion of words rec 
E(k, n+1)- E(k, n), between the 


This difference is the theoretical reca 
Thus we have the general relation 


alled on trial n + 1is the difference, 
cumulative values on successive trials. 
ll score and we Symbolize it by pn+i 


Po = 0, 


Pn = E(k, n + 1) — E(k, n), 
An alternative ex 


forn= 0, 
forn + 1> 0. |) 


bis = 3 uA, , nm). 6) 


The two expressions (4) and (5) are e 


quivalent, which can be shown aS 
follows. From (8) and (4 


) together we have 
n+l ঠি 
Past = Ends, n Hb — > tlds 0). 
ৰ k=0 
The first summation on the right can be rewritten by substituting for D(A, 
1 T+ 1) according to (1): 


n+l 


SS kplA,,n + DD = 
k=0 


n+l 


Ens, - r+ SpA ns 
5 kD(A, 1m) — ৯ kp(A;, iW) TE 


|] 


ৰ 2 ( + Dp(As, re - 
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When this result is substituted into the expression for p+; , We have 


Patt = 2 kp(de ,n)re + P (k + DD(di,n)rs 
-0 k=0 


চী Tip(di , nN), 


) 


which is the desired result. 

The asymptotic behavior of the model as 1 increases without limit can 
be deduced from the general solution (2). First consider the case in which 
one or more of the transitional probabilities 7; is zero. All the words start 
in state Ao and have a positive probability of moving along to states A; , 
As, etc., up to the first state, As, with zero transitional probability, 7» = 0. 
There the words are trapped; eventually all the words are recalled exactly 
h times and cannot be recalled again. This fact can be seen from (2): If 
7, > 0, then all the terms (1 — 7;)"in (2) go to zero as n —. Thus p(ds, n) 
goes to zero for k < h. Fork > h, the product in front of the summation 
must include 7, = 0, and so p(A;,n) = 0fork > h. When k = h, however, 
(1 — ri)" = (1 — 0)" = 1, and so this term in the summation of (2) does not 
£0 to zero. Instead, when s+ = 0 andr; > 0fori<h, 


ন] be FUEL “ae FE 
lim P(A, ,n) = দল = 1 


nw 


The recall score, p+: , then approaches zero as an asymptote; from (5), 


lim ous, = Xo nllim p(As,n)] = 0, 
nw k=0 nD 
since the probability at the asymptote is concentrated at state As, and for 
this state tr» = 0. This case is of little interest for an acquisition theory, 
since the asymptote of the learning curve is at zero. Therefore, in what 
follows, we shall be concerned only with the case in which all the r+ are 
different and greater than zero. 

If all the transitional probabilities Tr: are greater than zero, then from 
(2) we see that as n approaches infinity all the terms in the summation go 
toward zero for all finite values of k. Consequently the sum of the p(Ax, n) 
can be made as near zero as we please for any finite k by selecting a large 
enough value of n. In the limit, therefore, the probability of any finite 
number of recalls is zero. Since the sum of the D(A; , n) must equal unity, 
almost all the probability comes to be concentrated in state A» and we have 


for the limit when all 7: > 0, 


Dds, on) = 1. 
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We are now able to show that a word in state A, has probability A 
moving to state As. , if the learning process is continued indefinitely. y J ণঃ 
happens because almost all words eventually reach state As. Thus we ct 
write, for the probability of leaving state A, on some trial, 


DS upd, n) = 1, 


n=k 


or, 


> pd, 50) = cL fort, > 0. 
nk TE 
In all the cases we shall consider in this p 
an asymptote as k >=. 
tions on the r,: 


aper the value of cr, will approach 
We are interested in placing the following restrie- 


The first two conditions insu 


re that P(A, , n) goes toward zero for finite k 
and large n. 


‘The third condition provides the asymptotic value of ri for 
infinite k. In the summation for the limiting value of Pasi , all terms are zero 
out to infinity, and so we have 


Di Bo = olde oS Ss 6’) 


In other words, if we assume that m is the asymptotic value of r, as k 2%, 
then mis also the asymptotic value of Pn+1i AS N >. 
In the special cases discussed below, 


a restriction is placed upon the 
Value of 7, in the form of th 


e linear difference equation,* 

They = Hh QTE, (6) 
where 0 < a < LAnd 0 So Tf ade 
that 7,., is bounded between zero and 
acquisition, so that r,., > Te 

Consider the following development of (5): 


The limits for a have been chosen ae 
A f) n 
One and, since we are interested i 


n+l 


Pns2 = DD rip(di n+ 1). 


= 


*We have tried to observe the convention that par: 


i i ameters are represented by Greek; 
letters and statistical estimates are represented by Roman letters. In the case of a on 
m, however, we have violated this convention in order to make our symbols coincide % by 
those used by other workers. The symbols m, 2, a, and p were originally propose 
Bush and Mosteller. 
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Now substitute for p(x, n + 1) according to (1): 


n+l n+l 
Pn+2 = 2 TPA , YO — md) 2 TiD(Ai-1 , Nr 
- k=0 


= pis ™ 8 Tepe 1 n) + DB Te+1TAD(A 2 n). 
k=0 


ke0 


Next we substitute for r:+1 according to (6): 


fina = pat = De TO 50 F 2 (a + ar)rip(di ,n) 
ৰ) 


r=0 


= (1+ don — (1-0) DAs, n) 


= (1+) (1- o)E(re,n + 1), (7) 


where E(+: , n + 1) is the second raw moment of the Tr, (as pai iS the first 
raw moment) for trial n + 1. 

Restriction (6) brings the system into direct correspondence with a 
special case of the theory developed by Bush and Mosteller. In their termi- 
nology, an operator Q; is applied to the probability of response, Pp, to give 
a, + ap as the new probability whenever a trial is successful. A second 
operator Q: is applied to give a2 + ap whenever a trial is unsuccessful. In 
the present application of this more general theory, Q: is preserved intact by 
restriction (6), but Q: is assumed to be the identity operator. That is to 
Say, a, is zero and «2 is unity, s0 Q:p = 2. In the present application, an 
unsuccessful trial consists of the omission of the word during recall. It 
seems reasonable to assume that the non-occurrence of a word has no effect 
upon its probability of occurrence on the next trial. How successful this 
simple assumption is Will be seen when we examine the data. 


Analysis of the Data 
At the end of the experiment the experimenter has collected a set of 
e learner on successive trials. These 
ll number of words that did not occur 
s additions by the learner are of some 


word lists—the words recalled by th 
recall lists will usually contain a sma 


in the presentation. These spontaneou 
interest in themselves, but we shall ignore them in the present discussion. 


We would like to use the data contained in the word lists to obtain an 
estimate of p41 in (5). We shall refer to the estimate AS Tis ‘There are, 
Wwe suppose, VN words provided by the experimenter as learning material in 

It seems reasonable to assume that under certain con- 


the experiment. S ! 
ditions these words are homogeneous. By this we imply that the responses 
to all of the words in state A, may be considered as estimates of the same 


transitional probability of recall, Ts. . 
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shall consider is 


FU. = 12 


Ti+ =at+( — Gr. (12) 


In this form the model contains only the single parameter, a. The solution 
of the difference equation (12) is 


t= 1 = (i = at, (13) 


The interpretation of (13) in set-theoretical terms runs as follows: On 
the first presentation of the list a random sample of elements is conditioned 
for each word. The measure of this sample is a, and it represents the prob- 
ability, ro , of going from state Au to state A, . Tf a word is not reealled, no 
change is produced in the proportion of conditioned elements. When a 
word is recalled, however, the effect is to condition another random sample 
of elements, drawn independently of the first sumple, of measure a to that 
word. Since some of the elements Sumpled at recall will have been previously 
conditioned, after one recall we have (because of our assumption of inde- 
pendence between successive sumples): 


ann ক ৰ (Ey : oe aos eseer ad 


during presentation during the recall elements 


EA Sf = 0 = LT = 0 


This quantity gives us the transitional prob, 
As, from the first to the second recall. 


another independent random sample 
S0 we have 


ability 7, of going from Ai to 
The second time a word is recalled 
of measure a is drawn and conditioned, 


ETE 0 TES US BE LE (AL 


Continuing in this way generates the re 


| r lation (13). 
With this substitution the general 


difference equation (1) becomes 


Bdrm HD = pli 0 = BF pds hl = — 0 
The solution of this difference equation can be obtained by the general method 


EN Y the appropriate substitution for ri in (2). 
The solution is 


DMA, n) = CL: a)", 


A4.,n)=0- Tl 


1i=0 


LU =ad, 014) 


From definition (5) it is possible to Obfain the following recursive ex- 


GEORGE A. MILLER AND WILLIAM J. MCGILL 479 


pression for the recall on trial n + 1 (see Appendix B): 
sa = 04+ (0-90 -U- Ylos: (15) 


The variance of the recall score, r1+1 , iS 


Var (rn) = GN (Pass — 01). ao 

In order to illustrate the application of these equations, we have taken 

the data from one subject in an experiment by J. S. Bruner and C. Zimmerman 

(unpublished). In their experiment a list of 64 monosyllabic English words 

was read aloud to the subject. At the end of each reading the subject wrote 

all of the words he could remember. The order of the words was scrambled 
before each reading. A total of 32 presentations of the list was given. 

From the detailed analysis of the estimates of 7, derived from this 

subject’s data it was determined that a value of a = 0.22 would provide a 


80 


60 


64 MONOSYLLABLES 
0=0.22 


40 


PER CENT RECALLED 


20 


95 5 10 15 
TRIAL NUMBER 
FiGuURE 1 


Comparison of Theoretical and Observed Values of pn for the One-Parameter Case. Dotted 
line is drawn + one standard deviation from pn - 


good fit. In Figure l the values of ps1 computed from (15) are given by the 
solid function. The data are shown by the open circles. The dotted lines 
are drawn one standard deviation from pn+1 28 computed from the variance 
in (16). The single parameter gives a reasonably adequate description of 
these data, at least through the first 20 trials. From the 20th trial on, how- 
ever, it seems that the subject “forgets as fast as he learns.” He seems to 
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reach an asymptote somewhat below the theoretical value at unity. The 


introduction of an asymptote less than unity will be discussed in connection 
with the three-parameter case. 


PER CENT IN STATE Ap 


TRIAL NUMBER 


Ficurs 2 
Comparison of Theoretical and Observed Values of P(A; , n) for the One-Parameter Case 
As a further check on the correspondence of th 
shows the predicted and Observed values of D(A, 
k= 0,102,383. | 


eory and data, Figure 2 
nN) as a function of n, for 


Second Case: Two Parameters. 


| In the one-parameter form of the theory it is assumed that the propor- 
tion of elements sampled during the Presentation of the list is the same as 
the Proportion sampled during each recall. Most data are not adequately 
described by such a simple model. At the very least, then, it is necessary t0 
consider the situation when these two sampling COME ore different. In 
order to introduce the second Parameter, we phrase restriction (6) in the 
following form: 


To = Pha, 


a+ (0 - dr, an 


||| 


Ths 
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where po is the proportion of elements conditioned during the presentation. 
The solution of this difference equation can be written 


= 1-(- plo. (18) 


On the first presentation of the list a random sample of measure po is 
conditioned to every word. When a word is recalled, a random sample of 
measure a is drawn and conditioned. After one recall, therefore, the measure 


of conditioned elements is 

71 = Do + a — apo = 1-— (1-— DI d). 
After two recalls the measure of conditioned elements is 
1-(0-mt(l-a]t+a= all (-pl-dl 
1-0 - pl - a. 


Continuing in this way generates the relation (18). 
With this substitution the general difference equation (1) becomes 


72 


DAs, n+ 1) = pA, - Dl. =— a)* 
+ pA, -(-Dl- oo. 09) 
The solution of (19) is 
D(A, nN) = (1 — Do)", 
CLL owe FH -0O-plU-0 0-00, 
Pend = 0 BH 1-0- og (20) 


i=0 


When po = a, (20) reduces to (14). 
The recursive form for the recall now becomes (see Appendix B) 


past = Do F UU = p)ll — (1 — a)"]o . 021) 


The variance of r,+1 iS 
1 

Var Gs) = EN (Pus2 —™ Pasa): (22) 

the application of these equations we have selected 

two sets of data. The first set was collected by Bruner and Zimmerman. A 

list of 32 monosyllabic words was read aloud. At the end of each reading the 

subject wrote all of the words he could remember. The order of the words 

was scrambled before every reading. A total of 32 presentations of the list 
Was given. 

From the analysis of the ts 


In order to illustrate 


calculated for this particular subject it was 
found that a = 0.10 and po = 0.27 gave a good description of the data. In 
Figure 3 the values of p11 computed from (21) are shown by the solid fune- 
tion. The data are given by the open circles. The dotted lines are drawn 
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one standard deviation from p,., as computed from (22). As a further 
check, Figure 4 shows the predicted and observed values of p(A, , 1) as a 
function of n fork = 0, 1, 2,3. 

The distribution of cumulative recalls on any given trial provides still 
another way of viewing the data. In Figure 5, the cumulative distribution 
of k, the number of recalls, is shown for trials 5, 10, 15, 20. The proportion 
of test words recalled k times or less is plotted for comparison on each trial. 

The second set of data was collected by M. Levine. He read aloud a 
100-word anecdote. At the end of the reading, the subject wrote down all 
he could remember. Four such trials were given. The order of the words 
was not scrambled during the interval between trials. 

From the analysis of the data for this particular subject it was found 
that a = 0.87 and po = 0.61 gave a good description of the results. Figure 6 
shows the comparison of theory and experiment both for p,,, and for p(A:, n) 
fork = 01,2. 

As a general observation, we have noted that when the order of the words 
is not scrambled between trials, the parameter a is relatively large. This 
is to say, when the words are not scrambled, there is a much higher probability 
that the same words will be recalled on Successive trials. This effect is related 
to the serial-position curve. The subject recalls words at the beginning and 
at the end of the list. If these words remain in their favored positions, they 
continue to be recalled. New words are added to those recalled at the ends 
at 2 rate determined by po , so the learning works from the two ends toward 
the middle, which is the last to be learned. This effect has been noted with 
lists of randomly selected English words as well as with anecdotes. 


Third Case: Three Parameters 


In the one- and two-parameter cases we have assumed that after sufficient 


practice the subject should eventually reach perfect performance. Some data, 
however, seem to evade this simple assumption and so it is necessary‘ to con- 
sider what happens when a lower asymptote is introduced. Sucha parameter 


may be necessary when, for example, the period of time allowed for recall is 
limited. 


To introduce the third parameter we adopt the general restriction (6) 


To = po, 
Ti41 = QF or;,, where 0S US Ja El; (23) 
The solution of (23) can be written 
a a 
“TS -2) Lo, 


When « = 1 —a, (24) reduces to (18). From (24) we see that as fk increases 
without limit, rT, approaches a/(1 —a) as an asymptote. From (5') we know 
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retical and Observed Values of pa for a Two-Parameter Case. Dotted 


line is drawn + one standard deviation from pn . 
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Case 


that 7, and p+: approach the same as 


i ymptotic value, m. So we have the 
equation 


limo, = m= 5 (25) 
n+ l= 

Since 1 — « > a, m cannot exceed unity; and since botha>O0andl—-e 
> 0, m cannot be negative. In general, we are interested in cases where 
m > Do, for if po > m, we obtain forgetting rather than acquisition. 
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A set-theoretical rationalization for (24) runs as follows. On the pre- 
sentation of the material a random sample of elements of measure po is 
conditioned for every word. At the first recall a sample of measure 1 — « 
is drawn. Of these elements, a portion of measure a is conditioned and the 
remainder, 1 — a — a, are extinguished. We add the conditioned elements 
as before, but now we must subtract the measure of the elements conditioned 
during presentation and extinguished during recall, i.e., (1 a — a) po. Thus 
we have 

Ti = po + a — apo — (1 — a — a)po 


= m — (m — po)a. 
At the second recall the same sampling procedure is repeated: 


TEE ne 0 = 0 = =D 


=atan=m—-(m- po). 
Continuing in this way generates the relation (24). 

When (24) is substituted into (1), we obtain the appropriate difference 
equation, but its solution for the three-parameter case is hardly less cumber- 
some than (2). It would appear that the simplest way to work with these 
equations is to take advantage of our solution of the two-parameter case. 

First, we introduce a new transitional probability, 74 , such that 

Tt = Ti/m 
= 1 = 0 = PME; forpo < m.’ (26) 


This new variable is now the same as in the case of two parameters given in 
(18), with substitution of po/m for po and a for (1 — a). Therefore, from 
(2) and (20), we know that 


-(-2)" HT —- po/malll = a] oon) 


0 I -a" 
= p'(As , nN). 
he factor m* in the product in front 


When m ri is substituted into (2); t 
the denominator under the summa- 


ৰ 
of the summation cancels the factor m In 
tion. Thus we know that 


{ Ci ty Ti)" 
DCA) = বট ** Tk-1 DEE EARL (28) 
nt = 2) 


i=0 
i= 
ii 
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Which is the same as p’(A; , n) in (27) except for the numerator under the 
summation. This numerator can be written 


(=) = KL =m = ON 


=(1-m) +nmnMl-m ml -r) 


+ (ma = WY ML = BF =e 0 = 29 (29) 


Now we substitute this sequence for the numerator in (28) and sum term by 
term. When we consider the last term of this sequence we have 


ssw m(l- ri) 


LP 1 
TOT) "ৰ ন 2 
i=) TIt y ') 
Fy 


10 
iri 


which we know from (27) is equal to mp’ (A, , n). The next to last term 
gives 


k n~-1 n~-l1 
Ml —-m — 1)" 
lel XS ( m" (1-1) 


k ’ 


which we know from (27) is equal to n(1 — mjm"™! p'(A,,n — 1). Proceed- 
ing in this manner brings us eventually to the 


case where n < Fh, and then 
we know the term is zero. 


Cons i 
Onsequently, we can write 


DMA; ,n) = m'p'(d, 1) + nl — Mm pLds 0 = De 


ন্‌ a eS ho = WM) “mB A RB) 


= (Mma MB Be (30) 
tik 


When the asymptote is unity (m = 1). (29) and (30) reduce to the two- 
parameter case. 
We recall that because of the 


Way In which our probabilities were de- 
fined in (1), (30) can be written as 


Bde. n= > (i (=m) Dl. BH. 
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Now it is not difficult to find an expression for ps1 in terms of the p! computed 
in the two-parameter case: 


Bier. = 2 rips, 0) 
-0 


= m XY ripldi,n) 


ke0 
2 (ma =m) ip lds Ds 
ke0 i=0 


If we invert the order of summation, we find that 
PE BF (mma = MJ Xo rip Cds 0 
i= k=0 


Ln S ()m0 - wane gn 
The computation of ps1 by this method involves two steps: first, the values 
Of psi Are calculated as in the two-parameter case with the substitution 
indicated in (26); second, these values of p+: are weighted by the binomial 
expansion of [m + (1 —m)]" and then summed according to (31). 

These computations can be abbreviated somewhat by using an approxi- 
mation developed by Bush and Mosteller (personal communication). It is 


= (2+ a + 200)pni — la —-o)Tt+ (1 + a)(1 + 2a0)]o 


Pn+2 
EE ED CSE dp — 20. — BAU — x)p; — 3(1 — a)pnpani » 
(G2: (32) 


The approximation involves permitting the third moment of the distribution 


of the r+: around p, to go to zero on every trial. 
The variance of Tas in the three-parameter case is 


Var (rn) = শক [ose — (a T+ Dp]. (33) 


This expression for the variance of Tn+1 follows directly from (7) and (10). 
It is easily seen that (10) can be written as follows: 


YD DAs, n) = Paes — N Var (rs). (34) 
k=0 


Substituting (34) in (7) and solving for Var (rn+1) we find that 


J 
Var) = FI) [ones — (@ T+ pail, 
(33). The one-parameter and two-parameter 


which, except for notation, is Pp 3 
ial cases of this expression. 


variances (16) and (22) are spec 
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It is of interest to observe that when the limiting value, m, is substituted 


in (33) for pn+2 and ps1 , the limiting variance is found to be binomial. That 
is, 


lim Var (7,.i) = nm) “ 
This reflects the fact, established earlier in (5’), that as n grows very large 
the variance of the r, around m goes to zero. 

In order to obtain a numerical example, we have taken the data from 
another subject in the experiment by Bruner and Zimmerman. Sixty-four 
monosyllabic English words were read aloud and the order of the words was 
scrambled before every presentation. A visual inspection of the data led us 
to choose an asymptote in the neighborhood of 0.7. This asymptote is drawn 
On the plot of the t, in Figure 7 and on the plot of the r, in Figure 8. Then we 


1.00 T_T ন # = 


TRANSITIONAL PROBABILITY, Tk 


Lo) ES ESE SUS EE: UE) 


() 2 4 6 8 L) 12 14 i 
CUMULATIVE NUMBER OF RECALLS, k 
FicuReE 7 
Transitional Probability of Recall, 74, ax a Function of Numb i = 
Parameter Case. Values of te are indicated by open clrcles. EE Ri oe 
thetiis ++ = 0.7 — 0.57 (0.83), 


estimated po = 0.13 by considering all the trials on which words were in 
state Ao and calculating po as the weighted average of the lo.n41 for all those 
trials. Next we estimated the sampling parameter a = 0.83. 


oe 2 This was 
done by obtaining the estimates, t, , for Successive values of k; these estimates, 


together with (24), give us a set of equations estimating a. We used the 
weighted average of these estimates (ignoring negative values). Then we 
obtained a = 0.12 from the equation a = m(1l — a). 


Kk We shall comment on 
the estimation problems later. 
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When these parameter values were substituted into (24) we obtained the 
function for rT; shown in Figure 7. When the values were substituted into (28) 
for k = 1,2, 3, 4, we obtained the functions for p(A; , n) shown in Figure 9. 
When they were substituted into (31) we obtained the function, for p. Shown 
in Figure 8. In Figure 8 the dotted lines are drawn + one standard deviation 
from p. , 2S computed from (33). f 

A comparison of the values of p. computed from (31) and from (32) is 
given for the first eighteen trials in Table 1. With this choice of parameters 
the Bush-Mosteller approximation seems highly satisfactory. 


TABLE 1 
Comparison of Exact and Approximate Values of pn for First 18 Trials 
Trial Exact Approximate Trial Exact Approximate 
| ! . 1300 . 1300 10 2663 2655 
2 . 1426 1426 ll 2837 +2827 
3 1559 . 1559 12 3014 +3000 
4 . 1700 - 1700 13 3191 3174 
5 1847 1846 14 3369 3847 
6 . 2000 1990 15 3546 3520 
a 2159 2157 16 3722 3692 
8 2323 2319 17 3896 8862 
9 2491 2486 18 4067 .4030 
Discussion 


In the preceding pages we have made the explicit assumption that the 
several words being memorized simultaneously are independent, that memor- 
izing one word does not affect the probability of recalling 
list. The assumption can be justified only by its m 
because the data uniformly contradict it. 


another word on the 
athematical convenience, 
The learner's introspective report 
is that groups of words go together to form associated. clusters, and this 
impression is supported in the data by the fact that many pairs of words 
are recalled together or omitted together on successive trials. If the theory 
is used to describe the behavior of 50 rats, independence is a reasonable 
assumption. But when the theory describes the behavior of 50: words in 8 
list that a single subject must learn, independence is not a reasonable as- 


sumption. It is important, therefore, to examine the consequences of intro- 
ducing covariance. 


‘The difference between the independent and the dependent versions of 
the theory can best be illustrated in terms of the set-theoretical interpretation 
of the two-parameter case. Imagine that we have a large ledger with 1000 
pages. The presentation of the list is equivalent to writing each of the words 
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at random on 100 pages. Thus po = 100/1000 = 0.1. Now we select a page 
at random. On this page we find written the words A, B, and C. These 
are responses on the first trial. The rule is that each of these words must 
be written on 50 pages selected at random. Thus a = 50/1000 = 0.05. With 
the independent model we would first select 50 pages at random and make 
sure that word A was written on all of them, then select 50 more pages in- 
dependently for B, and 50 more for C. With a dependent model, however, 
we could simply make one selection of 50 pages at random and write all three 
words, A, B, and C, on the same sample of 50 pages. Then whenever A was 
recalled again it would be likely that B and © would also be recalled at the 
same time. 

‘The probability that a word will be recalled depends upon the measure 
of the elements conditioned to it (the number of pages in the ledger on which 
it is inscribed) and does not depend upon what other words are written on the 
same pages. Therefore, the introduction of covariance in this way does not 
change the theoretical recall, psi « The only effect is to increase the variance 
of the estimates of put In other words, it is not surprising that the equa- 
tions give a fair description of the recall scores even though no attention 
was paid to the probabilities of joint occurrences of pairs of words. Associa- 
tive clustering should affect the variability, not the rate, of memorization. 

The parameters a, po , and « obtained from the linear difference equa- 
tion (6), are assumed to describe each word in the list. Thus data from 
different words may be combined to estimate the various rt. . Jf the para- 
1] to Word, pus1 iS only an approximation of the mean 
probability of recall determined by averaging the recall probabilities of all 
the words. Similarly, the expressions given for pi cannot be expected to 
describe the result of averaging several subjects’ data together unless all 
subjects are known to have the same values of the parameters. 

The general theory, of course, is not limited to linear restrictions of 
the form of (6). The data or the theory may force us to consider more com- 
plicated functions for Tx . For all such cases the general solution (2) is 
applicable, though tedious to use, and will enable us to compute the necessary 
values of D(A, 1). 

Once a descriptive model of this sort has been used to tease out the 
necessary parameters, the next step is to vary the experimental conditions 
and to observe the effects upon these parameters. In order to take this next 
step, however, we need efficient methods of estimating the parameters from 
the data. As yet we have found no satisfactory answers to the estimation 
problem. oS s 

There is a sizeable amount of computation involved in determining the 
functions D(A: , 7) and pn - If a poor choice of the parameters a, po , and 
a is made at the outset, it takes several hours to discover the fact. In the 
example in the preceding section, we estimated the parameters successively 


meters vary from worc 
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and used different parts of the data for the different estimates. After pn 
had been computed it seemed to us that our estimates of po and m were both 
too low. Clearly, the method we have used to fit the theory to the data is 
not a particularly good one. We have considered least squares in order to 
use all of the data to estimate all parameters simultaneously. We convinced 
ourselves that the problem was beyond our abilities. Consequently, we must 
leave the estimation problem with the pious hope that it will appeal to some- 


one with the mathematical competence to solve it. 
Appendix A 
Solution for p(A; , n) in the General Case 


The solution of equation (1) with the boundary conditions we have 
enumerated has been obtained several times in the past (4, 5). We present 


below our own method of solution because the procedures involved may be 
of interest in other applications. 


Equation (1) may be written explicitly as follows: 
(1 — ro)P(As,n) = HAs, n +1) 
Todo, nN) + (1 — n)MA, ,n) = HA, ,nt+ 1) 
nds ,n) + (1 — r)P(A,,n) = pA, n+) 


This system of equations can be written in matrix notation as follows: 


tl 0D mA(As,n))  (nAontD 
To Ll —n 0 0 Ai, n) DMA, nt 
0 Ti 1-723 0 DAs ,n) MAs,nt Db 
0 0 হ2 i rg Re MAs,n)t = DAs, t+ 


This infinite matrix of transitional 
infinite column vectors made u 
n + 1 we shall call d, and d 


Probabilities we shall call T, and the 
P of the state probabilities on trial n an 
+1. SO we can write 


Td, = d; 


n+l 


The initial distribution of state Probabilities, do , is the infinite column vector 
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{1,0, 0,0, :-- }. The state probabilities on trial one are then given by 
Td, = di. 
The state probabilities on trial two are given by 
Td; = ds; 
so by substitution, 
Td, = T(Td) = Td, = d:. 
Continuing this procedure gives the general relation 
Td = dis 


Therefore, the problem of determining d, can be equated to the problem of 


determining T". 
Since T is a semi-matrix, we know that it can be expressed as 


T = SDS", 


where D is an infinite diagonal matrix with the same elements on its diagonal 
as are on the main diagonal of T (e.g., 2). ‘The diagonal elements of S are 


arbitrary, 50 we let Ss; = 1. Now we can write 
TS = SD 
1 0 0 L 0 D-IL 0 0 
Sg 0 Sn EK 0 0 1-7, 0 
IP = 
S31 S32 1 Pl) S31 S32 iL RY 0 0 1- T2 


Now it is a simple matter to solve for S;; term by term. For example, to 
solve for Ss», Wwe construct (from row 2 and column 1) the equation 


ঠ, Li-= Ti)Sn = Salt = 0), 


which gives 
By = To/(r: — 170). 


To solve for S31 , We Use the equation 


T1821 +(1A- 12) S31 রি সে Ssa(l মাৰ 70) 
S31 = T1S21/(Ts a! To) 


= ror: = T)KT2 — 70) - 
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Proceeding in this manner gives the necessary elements of 5, and we have 


if 0 0 0 
1 Eg 
(7; র) To) * D A 
TOT Pe Heo 
(7; ত To)(T2 ~~ To) (73 == ৰ) L kc 
LE 
TOTIT2 TIT T2 1 
(7, — To)(T2 — To)(Ta — 70) (72 — Tras — nm) (rs — ra) 

_ The elements of S™* can be obtained term by term from the equation 
SS = 1. For example, the element Sz, of S7' is given by row two of S 
times column one of S™*:ro/(n, — 7) + Si, = 0. Continuing in this way 
we have 

j 0 0 0 
EE SEE 
জেন ! AME. 
oY TUT) Ti 
SN Le EES = LA 
TOTIT2 TiT2 T 
(70 CY T3)(T ie T3)(T2 EE Ta) (7, ী T3)(T2 2 Ta) (72 = 173) 

These matrices permit a simple representation of the powers of the 

matrix T.. Thus, 


T= (SDS™)(SDS™Y = SD(S™'S)pDsS-' = SDs"! 
and in general, 


LE" = SDS. 


Since D is a diagonal matrix, D" is obtained by taking the nth power of every 
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diagonal element. When this equation for T" is multiplied through, we obtain 


T= 


tk = 70)" 0 0 
Tr ! = 0) (1 | লি 
le — 70) Ey (76— 71) = a) 0 


(1 EY To)" 
tlt (7: 2 To)(T2 শখ To) 


(1 লও a) 
A (5, = r)(r2— 71) Fc 


(1-7) | [e EE a] c 
(0 — Ta) — 72) 7G =) ts (1: — 72) (1— 1) 


Since T"do involves only the first column of T", it is not actually necessary 
to obtain more than the first columns of S™* and of T". We have presented 
the complete solution here, however. It can be seen from inspection of the 
first column of T" that (2) is the general solution: 


DAs, = (1 - Tv)", fork = 0, 
~ (1 =)" 

DAs, n) = TOT Th-1 DEERE fork > 0. (2) 
k I Cac 7) 


This general method of solution can be used for the special cases con- 
sidered in this paper, with the substitution of the appropriate values for ri. . 
Appendix B 
Recursive Expression for puss IN Two-Parameter Case 
From (20) we obtain the recursive relation 


stl pO NE EEO 
pA oF = ESL nl oe 


Rearranging and summing, We have 
n k+1 
bz E (1 a) EET 4 )] 


1 = =a" d 
= EW p)(l — apd, nm). 


k-0 
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The right side of this equation is, from (5) and (18), pus»; . The left side can 
be rewritten 


ডে Ee (SE 4 )] = 
ELE Lani Dust» 


Which becomes on trial n (with n 1); 


n y BY (1 ৰে a)* A 
2 [== nds, | ENT 


We now have, by adding and subtracting p(Ao , n), 


TTL AL 0 = LE = dae, od] =o, 


k=l 


1- PO - dA, m= 0-0 de. 


Now we know that 


Pst = T— (— py) 2 (0 — pds); 
k=0 
and so we obtain 


Pe = EN = PIL = f= Y= alo. 
Rearranging terms gives 


Ps = Do + (1 - DIL - (ajo, (21) 


Which is the desired result. 


From this result (15) is obtained directly by equating po and a. 


Appendix C 
List of Symbols and Their Meanings 


a parameter. 

Ady state that a word is in after being recalled #: times. 
a parameter. 

d, infinite column vector, having p(A, , N) as its elements. 
D infinite diagonal matrix similar to T. 

k number of times a word has been recalled. 

m asymptotic value of r, and pars 

n number of trial. 

N total number of test words to be learned. 

Noy number of words in state A, on trial The 

po probability of recalling a word in state 4s. 
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D(A, ,n) probability that a word will be in state A; on trial n. 


observed recall score on trial n; estimate of pi, . 

probability of recall on trial n. 

elements of S. 

elements of S™*. 

infinite matrix used to transform T into a similar diagonal matrix. 
estimate of rT: . 

observed fraction of words in state A; that are recalled on trial n. 


probability of recalling a word in state Ai . 
infinite matrix of transition probabilities 7: . 


Var (r,) variance of the estimate of pn. 


> 9910 


random variable equal to 1 or 0. 


REFERENCES 
Bush, R. R., and Mosteller, Frederick. A linear operator model for learning. (Paper 
presented to the Institute for Mathematical Statistics, Boston, December 27, 1951.) 
Cooke, R. G. Infinite matrices and sequence spaces. London: MacMillan, 1950. 
Estes, W. K. Toward a statistical theory of learning. Psychol. Rev., 1950, 57, 94-107. 
Feller, W. On the theory of stochastic processes with particular reference to applica- 
tions. Proceedings of the Berkeley Symposium on Mathematical Statistics and 


Probability, 1949, 403-432. 
Woodbury, M. A. Ona probability distribution. Ann. math. Statist., 1949, 20, 311- 


313. 


Manuscript received 3/11/52 


ULTIMATE CHOICE BETWEEN TWO ATTRACTIVE GOALS: 
PREDICTIONS FROM A MODEL:* 


FREDERICK MoOsSTELLERT 
HARVARD UNIVERSITY 
AND 
MAURICE TATSUOKA 


UNIVERSITY OF HAWAII 


A mathematical model for two-choice behavior in situations where both 
choices are desirable is discussed. According to the model, one or the other 
choice is ultimately preferred, and a functional equation is given for the frac- 
tion of the population ultimately preferring a given choice. The solution 
depends upon the learning rates and upon the initial probabilities of the 
choices. Several techniques for a proximating the solution of this functional 
equation are described. One of these leads to an explicit formula that gives 
good accuracy. This solution can be generalized to the two-armed bandit 
problem with ede reinforcement in each arm, or the equivalent T-maze 


problem. Another suggests good ways to program the calculations for a high- 
speed computer. 


The immobility of Buridan’s ass, who starved to death between two 
haystacks, has always seemed unreasonable. No doubt the story was invented 
to mock an equilibrium theory of behavior. One expects that any such 
equilibrium in approach-approach situations will be unstable—one of the 
attractive goals will be chosen. In this paper some properties that flow from 
a mathematical model for repetitive approach-approach behavior are dis- 
cussed. In the model for behavior in these choice situations, an organism 
initially shifts its choices from one to another, but after a while settles upon 
a single choice. 

Thus in the early part of the learning the theoretical organism may give 
some expression to the notion of an equilibrium by making different choices 
on different trials, but eventually even this behavior vanishes for the single 
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organism. On the other hand, some organisms may ultimately choose one 
goal and others another, so that a notion of equilibrium or balance could be 
recaptured across population of organisms. The quantitative aspects of a 
model for such behavior are investigated. The model employed is one dis- 
cussed by Bush and Mosteller [1 

A simple situation will be diseussed first 
encountered there will be related to the more complicated two-armed bandit 
problem with partial reinforcement on each arm. Suppose that on each 
trial of an infinite sequence an organism may respond (or choose) in one 
of two ways. For purposes of exposition, specify the ways as R and L (for 
right and left, say), so that for concreteness one ean think of a rat choosing 
the left-hand or right-hand side in a T-maze, or person choosing the left- 
hand or the right-hand button in a two-aurmed bandit situation. However, 
R and L are intended to stand for a general pair of attractive objects or 
responses, mutually exclusive and exhaustive, which lead to attractive 
Eoals. 

Suppose that on a given trial the probability of choosing R is Dp, and 
that of choosing Lis 1 — p, whereasusualO0 <p < 1. If Ris chosen, then the 
probability of choosing R next time is increased to ap +1 -—a,butiL 
is chosen the probability of choosing R next is reduced to asp, Where 
0 <a <l0S< asl. The point is that when a reinforcing choice is made, 
that choice has an increased probability of being chosen next time, and 
both R and L are regarded as reinforcing. The asymmetry in the formulas 
comes from the fact that the notation uses the probability of choosing R, 
and not the probability of choosing the particular side chosen on each trial. 
The operators used to change the probabilities are discussed by Bush and 
Mosteller ([1], p. 154 ff). 

Suppose the organism continues making the choices and that his prob- 
abilities are adjusted after every trial according to the rules just given. Then 
it can be shown that sooner or later the organism stops making one of the 
choices and thereafter chooses only the other. An extreme example occurs 
if both a, and ap are zero—then the organism chooses forever what he chooses 
first (one-trial learning). 

One mathematical problem is to discover the probability that the organ- 
ism eventually chooses R rather than L all the time. Jf he does choose R all the 
time, then he is said to be “ultimately attracted by Ry or Ris “ultimately 
attracting.” The desired probability should be expr ssible as a function of 
the initial probability p and of the attractiveness coefficients «a, and «s (the 
smaller an a, the more attractive the side). For convenience, this will be 
called the simple approach-aupproach problem, in contrast to the more compli- 
cated partial reinforcement problems. 

Consider now as an example a T-maze experiment with paradise fish 
described by Bush and Wilson [2]. On each trial of this experiment a fish 


, then the mathematical problem 
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started at one end of a tank and swam to the other, where the left or right 
side could be chosen. When the right-hand side was chosen, the fish was 
rewarded on 75 percent of the trials. When the left side was chosen, the fish 
was rewarded on 25 percent of the trials. The operation was to place the reward 
on one side or the other every time. In one group a fish was able to see the 
reward through a transparent divider when he chose the unrewarded side. 
In the other group an opaque divider was used. The data from these groups 
showed that the fish tended to stabilize on one side or the other. 

Within the framework of the operators described earlier in this paper, 
if p is the probability of choosing the right-hand side on a given trial, and 
if the right-hand side is chosen and rewarded, the new probability of choosing 
the right-hand side might be expressed as ap + 1 — oa. Tf the left-hand side 
were chosen and rewarded, the new probability of choosing the right might 
be reduced to ap. The parallel with the previous descriptions is very close. 

But suppose the side chosen is not rewarded. Then, essentially, three 
possibilities exist. 


(a) The side chosen is more likely to be chosen than it was before. The 
explanation might be, for example, that the organism is building up a habit 
pattern, or that he is secondarily reinforced for being in a place that earlier 
Was rewarding. 

(b) The side chosen is less likely to be chosen than before. The ex- 
planation might be, for example, that information has been re 
this side is not paying off. 

Whatever the explanation may be, the models corresponding to (a) 
and to (b) make quite different predictions. The model for (a) says that the 
probability associated with the side chosen is always increased whether 
reward is given or not. This ultimately implies—for the operators described 
here—that one side is chosen every time, that is, that eventually the organism 
stabilizes on one side. On the other hand, the model for (b) would imply 
that the organism does not stabilize. To see this, suppose that an organism 


is certain (p = 1) to choose the right-hand side—that is, he has stabilized 
on the right. Then because of parti 


perience some nonrewarded tri 


ceived that 


1 (b) has reflecting b 
(c) The probability is unchan 


depends upon the rewarded trials. 


ATTICTrS. 
Sed by a nonreward—then everything 
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In the experiment with paradise fish the data suggest model (a). In 
this paper we shall deal with the type (a) model. On the basis of the model, 
we would like to know (in terms of the learning rates, the initial probabilities, 
and the probabilities of reward on the two sides) what fraction of the organisms 
will stabilize on a given side. 

Because the numerical problem has turned out to be rather trouble- 
some, and because the general problem has some interest as shown by previous 
work, we will sketch various solutions that have been tried. Each of them 
is time-consuming in its development and testing, so a research worker will 
want to know what ground has already been plowed. 


Previous Work 


To facilitate discussion of previous work on the simple approach-approach 
problem, a functional equation for the probability that an organism is 
ultimately attracted to R will be derived. Let f(p, ; a1, a2) be the probability 
that an infinite sequence of trials ends in choices of R. Here, p, is the initial 
probability of choosing R. The transition rules are: if p, is the probability 
of R on trial n, then the probability of R on the next trial is 


(1) Desi = oe +1-—-a, if R is chosen on trialn, 
ote 


Q2Dn if L is chosen on trial n. 


In the sequel there is usually no advantage in referring to the trial number 
associated with p, s0 the subscript on p, is dropped and p stands for the 
initial probability. Similarly it is always to be understood that the desired 
function f depends upon «, and «2 ; so except when the full notation is needed, 
the notation f(p) will be used. 

The quantity f(p) may be composed of two parts—the parts corre- 
sponding to the choice of R or of L on the initial trial. Assume that each 
member of a large population has the same initial probability p of choosing 
R and is faced with the same simple approach-approach problem. Then, on 
the first choice the fraction p of the individuals choose R, and the new prob- 
ability of R is «ip + 1 — a for any member of this group. This means 
that in this group, the probability of being ultimately attracted by R is 
Kaip + 1 -— ai). Consequently this group contributes the portion 
2 flap + 1 — ai) to f(p). In the same manner those organisms choosing L 
first contribute (1 — 7D) f(oap) to 1(P). Thus one derives the basic functional 
equation for the simple approach-approach problem: 


@ IG) = pap + 1 — 1) + (1 — Dap). 


The boundary conditions are f(0) = 0 and {(1) = 1. These conditions hold 
because if p = 0, then L occurs, and the new probability for R is «2-0 = 0. 


502 READINGS IN MATHEMATICAL PSYCHOLOGY 


Therefore L is always chosen. Similarly if p = 1, then R occurs, and the new 
probability for Ris «ai-1 + 1 — a, = 1. Therefore R is always chosen. 
Thus (0) = 0 and {(1) = 1. These conditions for the function are needed 
because without them (2) only determines f to within a linear transformation. 
Thus if a certain f satisfies (2), direct substitution shows that Af + B also 
satisfies it (A and B are constants). 

Equation (2) could have had four parts if we related the desired prob- 
ability to the four terms occurring after two trials, or more generally 2" 
terms after n trials. These equations are all equivalent, but they can all be 
derived by successive applications of (2) to the f's appearing on the right- 
hand side. 

The properties of f(p) have been studied before by Bellman and by 
Shapiro ([3], Parts II and III), and by Karlin [4] (c.f. [1], p. 16344). Since 
not all of their results are readily accessible, those properties of {(p) especially 
useful here are given below. 

i. Nature of the solution. Equation (2) has a unique, monotone, analytic 
solution once the boundary conditions are given. With our boundary con- 
ditions the solution is convex for @, > a , concave for &i < Q2. The mono- 
tonicity is consistent with the probability interpretation given by the learn- 
ing model—for given «, and «a» , the larger the probability of choosing R 
initially, the more likely that Ris ultimately attracting. 

li. Solutions under special conditions. In what follows, suppose the 
relevant boundary conditions (0) = 0 and (1) = 1 to hold. The special 
conditions have to do with the values assumed by one or both of the a’s. 

(a) a, = a,» # 1. The solution is {(p) = », as implied by the fact that 
f(p) is both convex and concave and by the boundary conditions. 

(b) a, = a» = 1. The function f is not defined in our problem unless 
P = 1 or 0, because the probability of R never changes and no attraction 
occurs. 

(Cc) a, = 1, a» # 1. The occurrence of R leaves the probability of R 
unchanged because ap + 1 -— a, = Dp, 50 the process can only move toward 
Choosing more L’s unless p = 1. Thus f(p; 1, 2) = 0,0, # 1, p # 1, and 
H(l;l a) = 1. 

(d) «> = 1, a, # 1. Similarly fp; 1) = Lal RP? 0, 
and f(0;a, , 1) = 0. 

(8) &i = 0. Here, the only way to be ultimately attracted to Lis always 
to choose L. The probability of the latter behavior ix 


(3) 9p, a2) = 01 -—-pd- asp)(l — a3p) “* = J (1 = ap): 


Therefore the probability of ultimate attraction by R is 


0 fp; 0, a2) = 1 — g(p, a2). 


un 
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(f) as = 0. Here to be ultimately attracted by R is never to choose L. 
In this case 
(pj; ,0) = PED FE = ap Lt) > L = 0d 55 
= plaip + 1 -— aillaip + 1-— ai]. 


0) Hl 
1-0-Dl-atld-pDl - a(t —- Dl]. 


= I [1 - ai(l — Dl. 


In the second step above, note that if R oceurs on the first n trials, the prob- 


ability of Ris «tp + 1 — a (proved in [1]; Pp. 59). f | 
iii. Iterative properties. Any continuous initial approximation to f(p) 
to obtain in the limit the function f(p). That is, 


Can be iterated successively i i 
function f(p), then a better approximation 


Suppose fo(p) is a first guess at the 
1s given by the first iterate 


fi(p) = Dlolaap +1-a)t+ (1 — D)fola2p). 
Tor cxample if fo(p) = Pp, then fi(p) = ® Tt @&. = ap =. B)s 
More generally, the (n + Dst iterate is given by 

finilp) = plAap Tl — a) t+ (1 — D)lulasp). 


ns lead to a monotonic sequence of iterates. 
crates monotonically increase toward 
(p) if a2 SE 


Certain initial approximatic 

(a) Tf fo(p) = Dp, the successive it 

2) if a2 > a, , monotonically decrease toward f 
(b) If for the beginning approximation 


II 0 -a«a(l- pl, for or <a, 


(6) hp) = 


Ed 


1 -— II 1 - ap], for a220, 


10 
ically to the function. These results 


the iterates increase (decrease) monoton GLIOTs y 
when the approximations mentioned 


Provide two sequences of bounds for 1D) 


in (a) and (b) are used. J | 
5 The fe an converges geometrically, that is, after mn itera- 
lions one can be sure that the nth iterate fa(p) deviates from the correct 
Answer f(p) by no more than Ap", where A> 0,and0 <p < 1. Though 
Seometric convergence sounds speedy, if p were near 1, say 0.96, it would 
take more than 50 iterations to assure being within 0.14. The details needed 


for the calculation of A and p will not be provided. yk 
These important results pro ng point for studying the func- 


vide a starti k 
tion J(p), but they do not yield numbers or expressions whose values are 
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close to the true ones. In the remainder of this paper, several techniques for 
approximating f(p), are provided. 

A method designed for high-speed calculation will be considered first, 
then an excellent approximation obtained from a differential equation will 
be considered, and then that result will be extended to the two-armed bandit 
problem. Finally, brief mention of some other methods of approximating 
this functional equation will be given. 


Approzimation by Simultaneous Equations 


Consider a grid of numbers 0 (= po), Di , D2, ‘-- , Da, 1(= pai) in 
the unit interval, and write the functional equation (2) as it applies to each 
of these values of the independent variables. (Lest confusion with earlier 
notation develop note that p; still refers to probabilities, but the subscripts 
no longer correspond to trials as they did in earlier sections.) Then one has 
the set of equations 


IY -50 + 100), 
fp) = Diflaip, + 1 — a) + (-— Di)f(asp:), 
(7) fp) Pxflaips + 1 — a) + (1 — p2)f(azps), 


[| 


fp) = paflaip, + 1 — ai) + (1 — pi)flaxp,), 
fA) =D +0. 


The first and last members of this set of equations are, of course, tautologies; 
there are only n nontrivial equations. 

The right-hand sides of the n nontrivial equations of the set (7) each 
involves the values of f(p) at points that do not ordinarily coincide with any 
of the chosen grid points. However, by using an interpolation formula, both 
faipi + 1 — ai) and f(a2xpi), 1 = 1,2, --- , mn, may be approximated by 
linear combinations of the values of f(p) at two or more consecutive grid 
points p; , pi+1 » "“" - The number of grid points required depends upon 
whether one uses linear interpolation (two grid points), interpolation with 
second differences (three points), third differences (four points), and so forth. 

Whatever the number of points may be, each equation of the set (7) 
can be replaced by an approximate equality involving as unknowns just the 
values of f(p) at several predetermined grid points, and these unknowns 
occur only linearly. Thus a system of n linear equations is Obtained, approxi- 
mately satisfied by the n unknown quantities, f(pi), TRA Es f(p.). The 
idea of deriving a system of linear equations whose roots approximate f(p.), 
t= 1,2, ,n, was first suggested to us by J. Arthur Greenwood in an 
unpublished memorandum, in which linear interpolation was used to approxi- 
mate f(aip; + 1 ~— ai) and f(a2p,). 
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In this and in the following sections a standard numerical example in 
which a, = .75, as» = .80 is used to illustrate the various methods. This 
example has the advantage of being easily displayed; further, numbers are 
fairly easy to compute from it. It has the disadvantage of being relatively 
easy to fit, so the reader should not be misled into thinking that the precision 
attained for it is always obtainable. 

Eramptle. The method just described is illustrated for a grid of five 
equally spaced points, using the standard example, «as = 0.75, as = 0.80. 
Here, the functional equation is 


(8) I(P) = pI(0.75p + 0.25) + (1 — D)f(0.S0p). 


Taking Pp, = 0.25, p» = 0.50, ps = 0.75 and writing f(p:) = fe ; for 
short, in accordance with equations (7), 


fi = 0.25/(0.4375) + 0.751(0.20), 
(0) J» = 0.501(0.6250) + 0.50/(0.40), 
fs = 0.75/(0.8125) + 0.25/(0.60). 


First, linear interpolation will be used to approximate f(0.4375), 1(0.20), 
(0.6250), etc., by means of linear combinations of the five f’s: fo(= 0), 
fi dg ss SHA TAS 1). Thus; 

কন 0.5000 — 0.4375 0.4375 — 0.2500 
JOST) 05500 HF 02500 
0.25, + 0.75f: , 


Las 0.25 — 020, , 020:=0 
os" ~~ gg ht 


fz 


= 0.80, , 
and, similarly, 
(0.6250) ~ 0.50f> + 0.50fs , 
(0.40) =~ 0.40f, + 0.60f, , 
1(0.8125) > 0.75fs + 0.25f, = 0.75, + 0.25, 
1(0.60) =~ 0.60f2 + 0.40fs . 


Substituting these approximate expressions for the several functional values 
in the right-hand sides of (9) and collecting all terms involving the unknowns 
into the left-hand sides, one obtains 
0.3375f, — 0.18752 ~0, 
(10) — 0.2000, + 0.4500f, — 0.2500fs ~ 0, 
— 0.1500f: + 0.3375fs > 0.1875. 
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Replacing the ~ by = in the set of approximations (10) and solving 
the resulting equations, one obtains the following approximations to hE 
(The best available values are also shown for comparison.) 


Th, f, (approx.) best values 
05325 0.3385 0.1495 

50 0.6093 0.7286 

75 0.8276 0.8987 


The agreement with the best available values is only fair. 

Now use second-order interpolation for approximating the non-grid- 
point values of f(p) that occur in the right-hand sides of (9). The general 
formula (with equally spaced grid points) is 


uy RF [ ett ll += 35) 


AG (i Sr ba 

Where + = Ti, — 7, . Note that (11) gives the interpolated value as a weighted 
average of the three adjacent tabled Values instead of using differences. 

Applying (11) to the problem at hand and substituting these approxi- 


mate expressions into the right-hand Sides of (19), one obtains the following 
System of approximations. 


0.2410, — 0.1744, + 0.0235, = 0, 
(12) =0.1400f, + 0.3925, — 0.3150f, = 00625, 
~— 0.0497], + 0.1369f, = 0.0872, 
whose roots yield the following approximations. 
D. 1; (approx.) best values 
25 0.4279 0.4495 
.50 0.7122 0.7286 
“75 0.8955 0.8986 


These results are a definite improvement over those Obtained by linear 
interpolation. 

The above example seems to indicate that a considerable improvement 
of the approximation can be expected when higher differences are Used in the 
interpolation formula for expressing the non-grid-point values of f(P) in 
terms of the grid-point values. However, the interpolation formulas become 
more and more cumbersome to Work with numerically as higher differences 
are included. Jt therefore is pertinent to see how much improvement can 
be gained by increasing the number of grid points alone. 
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TABLE 1 


p, 4 5 ভর 1 21 Best value 
AL +1347 ‘1473 “AEST +1864 +1984 *2055 
+25 “IA? +3388 Rt: CS “AISZ *4375 +4495 
:50 +5690 +5955 6129 6872 133 +7286 
Pl +7845 +8007 =B1533 9666 + 8856 +8987 
.90 .9138 ‘9203 926!) +9476 +9586 9658 


Using only linear interpolation, approximations from grids of 4, 5, 6, 
11, and 21 points were obtained. These points were not equally spaced because 
it was hoped that better results would be obtained by spacing the grid so 
that the functional values would be approximately equally spaced. Infor- 
mation needed for such spacing was available from other methods described 
later. 

Linear interpolations were made in the results for the five grids de- 
scribed above to obtain approximate values at p = 0.10, 0.25, 0.50, 0.75, 
0.90. The numbers are shown in Table 1, together with the best known 
values. 

Using the difference between the best value and the cell entry for a given 
PD; as a measure of error, it will be noted that, very roughly, the error decreases 
linearly with the spacing. On the other hand, with a five-point grid, changing 
from linear to second-order interpolation gives improvement roughly equiv- 
alent to that given by increasing the number of points to 21 and using linear 
interpolation only. Since simultaneous equations are expensive to solve, it 
appears that second-order interpolation is well worth the effort, contrary to 
Usual advice. 

Calculations, with the aid of an electronic computer, using 21 grid 
Points and second-difference interpolation as well as third-difference inter- 
Polation have been made. The results are summarized in Table 2. The results 
Obtained by using second-order differences are hardly distinguishable from 
those using third-order differences, though in a more sharply curved example 
they could be more useful. The third-order interpolation column provided 
numbers labeled “‘best values” throughout this paper. 

In principle, any desired degree of accuracy can be attained by using 
finer grids, but the cost of the calculations increases roughly as the square 
of the number of grid points used. A high-speed computer could be 
Programmed to write its own equations and solve them, but such a program 
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TABLE 2 


Approxirations Using Seco hird-Order Interpslaticns 
with 21 Grid Point e Approximaticn 
By Secrnd Crder Differential Equat 10n 


p £ (second-order £ (third-order £ (differential 
& * interpolation) 2 interpolation) equation) 
00 390200 -00)0)2 909000 
‘05 -19718 -19778 Ele 2 
.10 - 20495 + 24547 - 19833 
=45 23447 *29455 + 28601 
-20 ‘37528 37504 + 35648 
-29 34913 44947 +440 35 
30 51637 51629 50774 
*35 57739 57736 56982 
*40 93279 63277 62626 
45 68304 68305 67764 
*50 72858 72859 72435 
-55 76981 76983 76672 
+60 .8omn1 80713 80504 
65 84082 84084 83967 
70 87126 87127 87088 
ভা +89873 89874 89891 
+80 92349 *92350 *92402 
85 94578 94579 94648 
+90 -96584 96584 96648 
95 *98387 98387 98425 
1.00 1.00000 1.00900 1.00000 


was not written. If good accuracy is required, the techniques proposed in 
this section are recommended. 


Approrimation by a Differential Equation 


An essential feature of the simultaneous-equations approximation 
discussed in the preceding section was the replacement of non-grid-point 
values of f(p) by linear combinations of grid-point values. The continuous 
variable analogue of this procedure is the expansion of flap + 1 — ai) and 
f(azp) as Taylor's series in the neighburhood of p. This approach will now be 
used to derive a differential equation whose solution yields an approximation 
to the desired function, f(p). 

Rewriting flaip + 1 —- ai) asf(p + (1 —- a) (1 -— p)) 


the latter as a Taylor's series, 
f+ I = al =p) = {0D +(d = ead = P)f'(D) 


(13) = NC be Le * 
3 Ee ROH Fp) fe 50s, 


» and expanding 
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where f’ and f” are the first and second derivatives of f with respect to p. 
Similarly, expand f(a2p) as follows: 
f(axp) = fp — (1 — a2)p) = fp) — (1 — an)pf'(P) 
(14) (1 = a2) Dp 
গচ EH. (= sas 


Using only through the term in f"(p) in the two series (13) and (14), 
substitute these expressions for the functions in the right-hand side of the 
functional equation (2). The result is a differential equation 
as) {0 = pl) + 0 — a0 — DI'D) + #01 — a¥°(1 — DF’O)] 

+ (1 -— DID) — 0 -— adpf'(p) + #1 — a)’pf"OD)]. 
By rearranging terms in (15), 

IH -— a) — [0 - a) -(- a) pii"®) + -afp =o. 

Tence, 


[0 De 
bit fo) ™ Ga = 0p = Taj" 


which is integrated to yield 


* i = 0) 1/0-a) 
(17) fo = alu MU EE ” | ’ 


where C; is a constant of integration, and 2 is an abbreviation for («; + as) /2. 
Integrating both sides of (17), 


| 1+1/(01-a) 
(18) ip) ণ Fr | +0, 


ai)" == 0 = ES) 


where C! and C; are new constants of integration. 
Determining Cf and C; from the boundary conditions (0) = 0 and 


(1) = 1, the final form of {(p) is 
A? — (A — p)* 


(19) 1D) = = (d= if | 
where 
Ml (1 -— ai)’ 
AE f= a) 
and 
|! 


TE Tae 
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Example: Taking «, = 0.75, «2 = 0.80, as before, calculate the constants 
occurring in (19). 


0:25: 2) 
= USE UBD = 
l 
B = 1 — 1.55/2 + 1 5.4444. 
Hence, from (19), 
» হর = 2 5.4444 
(20) = 260.42 — (2.7778 — 7) ৰ 


237.49 
Using (20), calculate the values of f(p) for BP = 0.25, 0.50, and 0.75. 


Dp: f, (approx.) best values 
25 0.4403 0.4495 
.50 0.7244 0.7286 
75 0.8989 0.8987 


Values of f(p) in intervals of 0.05 for p are shown in Table 2, where they may 
be compared with the best values so far obtained. Among the various approxi- 
mate methods which can be easily carried out with desk calculators, the 
differential equation method yields results in closest agreement with those 
obtained by the simultaneous equations using 21 grid points and third- 
difference interpolation. 


The Two-Armed Bandit 


The differential equation approach can equally easily be applied to the 
more general model appropriate to the two-armed bandit problem with 
partial reinforcement on each arm (or the equivalent T-maze experiment). 

Suppose that there are two responses R and L, and whichever occurs 
a reward or a nonreward follows. If R occurs, reward follows with prob- 
ability m, ; if L occurs, reward follows with probability 2. If p is the prob- 
ability of R on a given trial, the new probability for R is as follows. 


New probability Probability 
forR of happening 
ap + 1 — a if R and reward occur Tp 
azp + 1 — a if R and nonreward (1-— mi)p 
aip if L and reward T2(1 — Dp) 
asp if L and nonreward (-r)(0-Dp) 


These results represent a special case of those presented in ([{1], p. 118, 286) 
and discussed briefly on p. 287 in the paragraph following equation (13.22) 


in [1). 
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It has been assumed that reward is equally effective on either side and 
that nonreward is also equally effective on cither side. It should be recalled 
that these transition rules imply that nonreward improves the probability 
of choosing a given side, as discussed in the opening section of this paper. 

Now in the same way that the basic functional equation (2) for the 
simple approach-approach problem was derived, the basic functional equa- 
tion for the two-aurmed bandit problem with partial reinforcement can be 
derived. The functional equation for the proportion f(p) of organisms who 
eventually learn to make only response Rt is 


{(P) = Priflap + 1 — a) +l =niap t+ l= as) 


(21) 
+ (1 — Draf(aip) + (1 — DL -— r2)f(a2D). 


No generality is lost, and there is some gain in the sequel, if it is assumed 

thata, < asandmri 2m2.lim=1l and 2 = 0, (21) reduces to (2). 
Using the approximations (13) and (14) for f(aip + 1 — ai) and f(a;p), 

respectively, (21) can be rewritten, after rearrangement of terms, as 


[(m: Ca 2) (1 Ep an)" ং [0 কন a) }p EE {mill ar an) 
ba +0 -n)0- a)" = 2, - nr): — af. 
The boundary conditions are f(0) = 0, (1) = 1, as before. 


Comparing (22) with the corresponding differential equation, (16), 
for the simpler model, the general solution of (22) has the form 


AS =p 


(29) Me EE 


where the constant A is now defined as 


ঢ nll EEE on)" ছা ও (1 ঞঃ i)(l Eo a2) x 2 1 র্চ গা2 
UAE (Tr, — T2)K(1 - a)" —- (1 - 2). (a, 3 ক 2) 


while 


1 
B= Te FED 
as before. Note that the expression for A for the simple approach-approach 
problem is obtained by substituting tm, = lL, m2 = 0 in the present A. 

The expression for A is undefined when either «a, = as Or Tm) = T32, 
hence (23) cannot be used. In each of these cases, however, it can be argued 
from first principles that the function sought is f(p) = 7». This result is also 
given by the differential equation (22), which reduces to f"(p) = 0 under 
these special conditions. 
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Monte Carlo Calculations for Two-Armed Bandits 


Twery and Bush made a series of Monte Carlo calculations on Iliac 
of (0.50) for two-armed bandit experiments with t:, = 0.75, 2 = 0.25 for 
various combinations of a-values. The case of «, = 0.90, «a» = 0.95 will be 
used to calculate the value of (0.50) from (23). 

For the stated parameter values, 


(0.75)(0.10)* + (0.25)(0.05)° 
0.50){(0.10)* — 0.053%] 


A= = 2.1667, 


1 
B= Te E57 + 1 = 14.3333. 


Hence, (23) in this case becomes 


} 65018.7 — (2.1867 = pn) 
ES Ln 65006.6 


From this formula, 
(0.50) = 0.977, 


compared with Twery and Bush's result, 0.970. 
The values of (0.50), calculated from (23) for the various combinations 
of alpha values used by Twery and Bush, are shown in Table 3 along with 


TABLE 3 


Comparison of Differential Equation Results (first entry) 
With Those of Twery and Bush (second entry) 
Obtained from the Mean Probability Level of 100 Sequences 
At the 800th Trial for Various aj, a), for p=0.5 


And ডি = 1 - "2 8,55 
| 92 | =94 | .93 | 94 95 | 96 97 
a + | |: | ! { 
1 || 
| + -—_া + - —- 
90 | 634 | .170 | .878 | .944 EE ssn 1 ama 
610 | .780 | .880 958 | 9700 | === রী 
- - -— - < es 
91 | E68 | m6 80% | 92 ns oa 
700 | .840 | .900 | .9790 ie ন 
2 + lhe ta 
92 l | 207 | 877 962 990 --- 
| | । “669 | -880 | .980 -997 === 
t F | t 
93 | j | -763 | 933: | 987 .988 
| { | .834 | .960 | .990 ; 1.000 
~ + + L L * 
-94 | | | \ +835 975 | == 
| |, SEE FE he Ee 
1 f- | f f t + 
95 | || | | | -916 | 996 
| 826 | .999 
—~ 2 t + 
96 [ | | | .979 


[5 
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the Monte Carlo result obtained by these authors. Their numbers were 
obtained in a pseudo-experiment in which 100 sequences of 800 trials each 
were run with random numbers. The entry itself is the average value of p 
for the 100 sequences at trial 800. Thus it has some random variation and 
is pre-asymptotic to the extent that 800 trials is not an infinite number. 
The agreement is quite encouraging for the use of the differential-equation 
method. The agreement between the Monte Carlo results and the differential 
equation is surprisingly close, considering that only 100 sequences were used 
and that the differential equation is only an approximation. On the other 
hand, both learning parameters are near unity in these examples; in that 
neighborhood the differential equation should be quite a good approximation. 


T-maze Experiment with Paradise Fish 


In the first section of this paper, a T-maze experiment by Bush and 
Wilson [2] using paradise fish was described. The rate of reward was 0.75 
for response R and 0.25 for response L. In the notation of our model, wt, = 0.75 
and 2 = 0.25. The learning-rate parameters were estimated to be a, = 0.916 
and a» = 0.942 for the group in which the fish could see the reward through 
a transparent divider when they chose the unrewarded side. The initial 
probability for response R (estimated from results on the first 10 of the 140 
trials) varied considerably from one fish to another, the average value being 
0.496, or nearly 0.50. Bush and Wilson report that the initial distribution 
of p approximately followed the symmetrical Beta distribution 


(25) Y = 3.61001 — DY. 


This initial distribution was used to calculate the expected fraction 
attracted by R. The relative areas under the curve (25) in the ten intervals 


(0; 04s Oak; O21, ==, (0:9, 10] 


were found, the values of f(p) at the midpoints of these intervals were caleu- 
lated, and their weighted average was obtained. The result was f(p) = 0.800. 

In the experiment, Bush and Wilson found 15 of the 22 fish in the ex- 
perimental group making nearly all R responses after about 100 trials. This 
leads to the estimate 0.68 for the proportion ultimately attracted to the R 
response. That result is only about one standard error away from the fitted 
value 0.80. That small deviation does not even take any account of the 
unreliability of the original estimates of the a’s. 


Other Methods 


Several other methods of approximating the function have been explored. 
One that was rather successful employed the function f(p; «, 0) or 
(1 — p; 0, a), choosing a value of «a that made the iterate change very 
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little. This méthod was superior to an iteration technique beginning with 
fo(p) = 7. 

Since one knows exactly the solution to the functional equation in the 
special case «, = a, the notion of expanding f(p; a; , a») 2S a power series 
in «» in the neighborhood of «, suggests itself. Robert R. Bush, in an un- 
published note, developed such a technique. 
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A THEORY OF DISCRIMINATION LEARNING! 


FRANK RESTLE 
Stanford University? 


This paper presents a theorv of 
two-choice discrimination learning. 
Though similar in form to earlier 
theories of simple learning by Estes 
(5) and Bush and Mosteller (2,3), 
this svstem introduces a powerful 
new assumption which makes definite 
quantitative predictions easier to ob- 
tain and test. Several such predic- 
tions dealing with learning and trans- 
fer are derived from the theory and 
tested against empirical data. 

‘The stimulus situation facing a sub- 
ject in a trial of discrimination learn- 
ing is thought of as a set of cues. A 
subset of these cues may correspond 
to any thing—concrete or abstract, 
present, past, or future, of any de- 
scription—to which the subject can 
learn to make a differential response. 
In this definition it does not matter 
whether the subject actually makes a 
differential response to the set of 
cues as long as he has the capacity to 
learn one. An individual cue is 
thought of as “‘indivisible" in the 
sense that different responses cannot 
be learned to different parts of it. 
Informally, the term ‘cue will occa- 
sionally be used to refer to any set of 
cues, all of which are manipulated in 
the same way during a whole experi- 
ment. 


1 This paper is adapted from part of a 
Ph.D. dissertation submitted to Stanford 
University. The author is especially in- 
debted to Dr. Douglas H. Lawrence and to 
Dr. Patrick Suppes for encouragement and 
criticism. Thanks are also due Dr. Ww. K. 
Estes who loaned prepublication manuscripts 
and Dr. R. R. Bush who pointed out some 
relations between the present theory and the 
Bush-Mosteller model (3). 

2 Now at the Human Resources Research 
Office, The George Washington University. 


In problems to be analyzed by this 
theory, every individual cue is either 
“‘relevant'' or “‘irrelevant." A cue is 
relevant if it can be used by the sub- 
ject to predict where or how reward is 
to be obtained. For example, if food 
is always found behind a black card 
in a rat experiment, then cues 
aroused by the black card are rele- 
vant. A cue aroused by an object 
uncorrelated with reward is “‘irrele- 
vant.” For example, if the reward 
is always behind the black card but 
the black card is randomly moved 
from left to right, then “position” 
cues are irrelevant. These concepts 
are discussed by Lawrence (6). 

In experiments to be considered, 
the subject has just two choice re- 
sponses. No other activities are con- 
sidered in testing the theory. Any 
consistent method of describing these 
two responses which can be applied 
throughout a complete experiment is 
acceptable in using this theory. 


THEORY 


In solving a two-choice discrimina- 
tion problem the subject learns to 
relate his responses correctly to the 
relevant cues. At the same time his 
responses become independent of the 
irrelevant cues. These two aspects 
of discrimination learning are repre- 
sented by two hypothesized processes, 
conditioning" and “adaptation.” 

Intuitively, a conditioned cue is one 
which the subject knows how to use 
in getting reward. Tf kis a relevant 
cue and c(k,n) is the probability that 
+ has been conditioned at the begin- 
ning of the nth trial, then 


c(kn+1)=c(kn)+o[1—-c(k,n)] [1] 


This article appeared in Psychol. Rev., 1955, 62, 11-19. Reprinted with permission. 
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5:16 


is the probability that it will be con- 
ditioned by the beginning of the next 
trial. On each trial of a given prob- 
lem a constant proportion, 8, of un- 
conditioned relevant cues becomes 
conditioned. 

To the extent that a conditioned 
cue affects performance, it contributes 
to a correct response onlv, whereas 
an unconditioned relevant cue con- 
tributes equally to a correct and to 
an incorrect response. 

Intuitively, an adapted cue is one 
which the subject does not consider 
in deciding upon his choice response. 
If a cue is thought of as a “possible 
solution" to the problem, an adapted 
cue is a possible solution which the 
subject rejects or ignores. If a(k,n) 
is the probability that irrelevant cue 
k has been adapted at the beginning 
of the nth trial, then 


a(k,n+1)=alkn)+o1—a(k,n)] [2] 


is the probability that it will be 
adapted by the beginning of the 
next trial. On each trial of a given 
problem a constant proportion of 
unadapted irrelevant cues becomes 
adapted. An adapted cue is non- 
functional in the sense that it con- 
tributes neither to a correct nor to 
an incorrect response. 

It will be noticed that the same 
constant 06 appears in both equations 
1 and 2. The fundamental simplify- 
ing assumption of this theory deals 


with 0. This assumption is that 
ff 
a rt+7 [3] 


where 7? is the number of relevant 
cues in the problem and zis the num- 
ber of irrelevant cues. Thus, 0 is the 
proportion of relevant cues in the 
problem. This proportion is the same 
as the fraction of unconditioned cues 
conditioned on each trial, and the 
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fraction of unadapted cues adapted 
on each trial. 

The performance function p(n), 
representing the probability of a cor- 
rect response on the nth trial, is in 
accord with the definitions of condi- 
tioning and adapting given above. 
The function is in the form of a ratio, 
with the total number of unadapted 
cues in the denominator and the num- 
ber of conditioned cues plus one-half 
times the number of other cues in the 
numerator. Thus conditioned cues 
contribute their whole effect toward a 
correct response, adapted cues con- 
tribute nothing toward cither re- 
sponse, and other cues contribute their 
effect equally toward correct and in- 
correct responses. Formally, 


X cen) +32 [1—c(ein)] 


+x [1 —alkn)] [4] 
r+2 [1-atk,n)] 


b(n)= 


r 
Here X, is the sum taken over the r 
1 
relevant cues and XY is the sum taken 

over the ¢ irrelevant cues. 


SoME CoxsrQU ES REGARDING 
SIMPLE LEARNING 


If the subject is naive at the begin- 
ning of training, so that for any rele- 
vant cue , c(k,1) = 0, and for any 
irrelevant cue , a(k,1) = 0, and if he 
receives 1 trials on a given problem, 
then by mathematical induction it 
can be shown that if & is relevant, 


ceknt+1)=1- (0-9) 


and if kis irrelevant, 


[5] 


alkn+1)=1-(-—-09)". [6] 


Under these circumstances we can 
substitute equations 5 and 6 into 
equation 4 and, taking advantage of 
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the simplilving effects of equation 3, 
we have 


Plotting equation 7 shows that p 
is an S-shaped function of # with an 
asymptote (for 8 > 0) at 1.00. Also, 
P(l) = 2. Since p(n) is amonotonic 
increasing function of 0 we can esti- 
mate 0 from observations of per- 
formance. If we want to know the 
theoretical proportion of relevant cues 
in a problem for a particular subject, 
we have the subject work on the prob- 
lem, record his performance curve, 
id solve equation 7 for 6. This 
result depends directly upon the sim- 
plifving assumption of equation 3. 

Since the instability of individual 
learning curves makes it difhcult to 
fit curves to them, it is fortunate that 
0 can be determined in a different way. 
Suppose a subject makes F errors in 
the course of solving the problem to 
a very rigorous criterion and it is 
assumed for practical purposes that 
he has made all the errors he is going 
to make. Theoretically, the total 
number of errors made on a problem 
can be written 


E 
B= 2, Lt = b(n) 
n=l 
Under the conditions satisfving equa- 
tion 7, this can be evaluated approxi- 
mately by using the continuous time 
variable t in place of the discrete trial 
variable mn, and integrating. The 
result of this integration is that 


log 8 
E~3+% 


a y EB 
#0 =) log (=) [8] 


By equation 8, which relates the total 
number of errors made on a problem 
to 9, it is possible to make relatively 
stable estimates of 06. 
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AN EMPIRICAL TEST OF THE SIMPLE 
LEARNING THEORY— 

COMBINATION OF C 


Consider three problems, si, ss, and 
53, all of which involve the same irrele- 
vant cues. Two of the problems, si 
and ss, have entirely separate and 
different relevant cues, while in prob- 
lem ss all the relevant cues of s+ and ss 
are present and relevant. That is, 
fa = rit rsandd,=12=i3. If we 
know 0, and 82 we can compute 03, 
since by equation 3 


0; ri/(ni +1) 
02 = r2/(12 +1) 
03 (rin) ). 


Solving these equations for 8s in terms 
of 0; and 02 we get 


0, = (09; + 0: — 20,0:)/(1 — 0,82). [9] 


This theorem answers the following 
question: Suppose we know how many 
errors are made in learning to use 
differential cue X and how many are 
used to learn cue Y, then how many 
errors will be made in learning a prob- 
lem in which either X or Y can be 
used (if X and Y are entirely dis- 
crete)? 

Eninger (4) has run an experiment 
which tests equation 9. Three groups 
of white rats were run in a T maze 
on successive discrimination problems. 
The first group learned a visual dis- 
crimination, black-white, the second 
group learned an auditory discrimina- 
tion, tone-no-tone, and the third 
group had both cues available and 
relevant. 

Since each group was run to a 
rigorous criterion, total error scores 
are used to estimate 6, and 8» by equa- 
tion 8.3 The values estimated are 


3 Total error scores do not appear in 
Eninger's original publication and are no 
longer known. However, trials-to-criterion 
scores were reported. Total error scores were 
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0, = .020, based on an estimated 
average of 98.5 errors made on the 
auditory-cue problem, and 82 = .0209, 
based on an estimated average of 
64.5 errors on the visual-cue problem. 
Putting these two values into equa- 
tion 9 we get 


9s = .029 + .020 + 2(.020)(.029)/ 
1 — (.020) 0.029) 
= .049. 


This value of 83 substituted into 
equation 8 leads to the expectation of 
about 33 total errors on the combined 
cues problem. In fact, an average of 
26 errors was made by the four sub- 
jects on this problem. The predic- 
tion is not very accurate. However, 
only 14 animals were employed in the 
entire experiment, in groups of five, 
five, and four. Individual differences 
among animals within groups were 
considerable. If account is taken of 
sampling variability of the two single- 
cue groups and of the combined-cue 
group of subjects, the prediction is 
not significantly wrong. Further ex- 
perimentation is needed to determine 
whether the proposed law is tenable. 

It is easily seen that 03 will always 
be larger than 0; or 0s if all three 
problems are solved. Learning will 
always be faster in the combined-cues 
problem. Eninger (4) in his paper 
points out that this qualitative state- 
ment is a consequence of Spence's 
theory of discrimination. However, 
Spence's theory gives no quantitative 
law. 


TRANSFER OF TRAINING 


In order to apply this theory 
to transfer-of-training experiments in 
which more than one problem is used, 
certain assumptions are made. It is 


estimated from trials-to-criterion scores by 
using other, comparable data collected by 
Amsel (1). Dr. Amsel provided detailed 
results in a personal communication. 
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assumed that if a cue is conditioned in 
one problem and appears immediately 
thereafter as a relevant cue in a new 
problem, it is still conditioned. © Like- 
wise, an adapted cue appearing as an 
irrelevant cue in a new problem is 
adapted. However, if a conditioned 
cue is made irrelevant it is obviously 
no longer conditioned, since it cannot 
serve as a predictor of reward. Simi- 
larly, it is assumed that if an adapted 
cue is Made relevant in a new problem, 
it becomes unadapted and available 
for conditioning. 

According to the present definition 
of conditioning, a conditioned cue 
contributes to a correct response. 
Therefore the above assumptions will 
not hold if the relation between a cue 
and a reward is reversed in changing 
the problem. This theory cannot be 
used to analyze reversal learning, and 
is applicable only in cases in which 
relevant cues maintain an unchanging 
significance. 

If two problems are run under the 
same conditions and in the same appa- 
ratus, and differ only in the degree of 
difference between the discriminanda 
(as where one problem is a black- 
white and the other a dark gray-light 
gray discrimination), it is assumed 
that both problems involve the same 
cues; but the greater the difference to 
be discriminated, the more cues are 
relevant and the less are irrelevant. 


EMPIRICAL TESTS OF THE TRANSFER- 
OF-TRAINING THEORY 


As Lawrence (7) has pointed out, 
it seems that a difficult discrimination 
is more easily established if the sub- 
jects are first trained on an easy prob- 
lem of the same type than if all 
training is given directly on the diffi- 
cult discrimination. The experimen- 
tal evidence on this point raises the 
question of predicting transfer per- 
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formance from one problem to an- 
other, where the two problems involve 
the same stimulus dimension but differ 
in difficulty. 

Suppose that problems si and ss 
both require a discrimination along 
the same stimulus dimension and 
differ only in that ss is more difficult 
than 51. Let 0, be the proportion of 
relevant cues in problem s: and 0» be 
the proportion of relevant cues in ss. 
Suppose that the training schedule 
involves # trials on problem si fol- 
lowed by j trials on problem s». Then 
the probability of a correct response 
on trial » + jis 


0+ 3010-0) —-0,+(1-09)'(1-0,- 62) ] 
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lem without prior experience, their 
performance on the first problem 
serves to estimate 0;, the proportion 
of relevant cues in the easier pretrain- 
ing problem. Lawrence replicated 
the experiment, having two experi- 
mental groups, ATG No. 1 and ATG 
No. 2, each of which transferred 
abruptly from an easy pretraining 
problem to the test problem. Group 
ATG No. 1 had a very easy prob- 
lem for which we estimate 0; = .14. 
Group ATG No. 2 had a more difh- 
cult problem for which 06,’ = .07. 

For group. ATG No; 1; 0; = 14, 
02 = .04, and # = 30 since thirty 


[10] 


bm t+) = gE 


This theorem can be tested against 
the results of experiments reported by 
Lawrence (7). He trained white rats 
in one brightness discrimination and 
transferred them to a more difficult 
problem for further training. A con- 
trol group, which Lawrence called 
“HDG,” learned the hard test prob- 
lem without work on any other 
problem. The performance of this 
control group is used to estimate 0s, 
the proportion of relevant cues in the 
test problem. The value found was 
‘04.5 Since the experimental subjects 
first worked on the pretraining prob- 


‘ The justification of equation 10 involves 
no mathematical difficulties. On the first 
trial of transfer we know the probability 
that any cue relevant in the second problem 
is conditioned, since all cues relevant in the 
second problem were relevant in the first. 
Similarly, we know the probability that 11 of 
the is irrelevant cues are adapted. The 
other 12 — ti cues are unadapted. Equations 
1 and 2 can be applied at this point, and all 
terms divided by r1 + ti(= r: + 12). 

5 These estimates were made by the un- 
satisfactory method of fitting equation 7 to 
group average learning curves. Therefore 
the results regarding Lawrence's experiment 
are approximate. 


02)'[0, 


09+ (1 -— 9yn+] 
trials of pretraining were given. From 
this information we can compute 
b(n +) for all j, using equation 10. 
The predicted transfer performance is 
compared with observed performance 
in Table 1. For group ATG No. 2, 
8 = .07, 0: = .04, and # = 50 since 
fifty trials of pretraining were given. 
Here also, p(n + 5) can be computed. 
Prediction is compared with observed 
performance in Table 1, from which 
it can be seen that the predictions are 


TABLE 1 
PREDICTION OF EASY-TO-HARD TRANSFER 
IN RATS* 
Proportion of Correct Responses 
Trials of 
Transfer Group ATG 1 Group ATG 2 
Training 
Observed | Predicted | Observed | Predicted 
1-10 .66 63 81 Tl 
11-20 ‘70 68 83 TT 
21-30 74 72 81 81 
31-40 84 .78 
41-50 .86 83 


* Data from Lawrence (7). 
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relatively accurate, though perform- 
ance is higher than predicted. 

Lawrence also considered the possi- 
bility that a gradual transition from 
easy through successively harder prob- 
lems would result in rapid mastery of 
the difficult problem. He tested this 
proposition by giving another group 
of subjects a series of three pretest 
problems before the final test problem. 
The problems in order of ease of learn- 
ing were, first, the problem learned by 
ATG No. 1 with 0: = .14, an inter- 
mediate problem which was not other- 
wise used, the difficult pretest problem 
with 0: = .07, and finally the test 
problem with 9, = .04. 

To estimate 02 in Lawrence's experi- 
ment where problem s2 never was 
used separately in simple learning, 
we notice the relation of 06 to differ- 
ences between discriminanda in appa- 
rent foot-candles for problems si, 53, 
and ss, whose 6 values are known. 
We know that if the problems are 
properly controlled, and the stimulus 
difference is zero foot-candles, there 
are no relevant cues and 0 is zero. It 
was found that this assumption, along 
with available data, made it possible 
to write a tentative empirical function 
relating 6 to the difference between 
discriminanda in foot-candles. This 
equation presumably holds onlv in the 
case of Lawrence's apparatus, train- 


TABLE 2 
THE RELATION OF “DIFFERENCE BETWEEN 
STIMULI" AND 0 VALUE OF PROBLEM* 


Difference Between Corresponding 


Discriminanda in 0 Value of 
Apparent Foot-Candles Problem 
51.7 14 
35.2 3 
14.0 07 
5.9 04 
0.0 0 


* Data from Lawrence (7) 
#* Estimated by interpolation from empirical euua- 
tion 16 


T Theoretical see text for explanation. 
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TABLE 3 
PREDICTION OF TR ER PERFORMANCE OF 
RATS AFTER A SERIES OF PRETRAINING 
PROBLEMS* 


Trials Working | Proportion of Correct Responses 


on Final Test ডি 


Problem Observed Predicted 
1-10 3 সন 
11-20 .82 ‘79 
21-30 .87 84 
31-40 .89 8 
41-50 -90 Ee 


+ Data from Lawrence (7). 
ing procedure, subjects, etc. The 
equation adopted is 


6 = .0988 logio(.4d) [11] 


where d is the difference between dis- 
criminanda in foot-candles. It is em- 
phasized that this equation has no 
theoretical significance and is merely 
expedient. From equation 11 it is 
possible to determine the 0 value of 
the intermediate pretraining problem 
by interpolation. Table 2 gives the 
data and results of this interpolation. 

Ten trials were given on each of the 
first three problems and fifty trials 
on the final test problem. Using the 
9 values in Table 2 it is possible to 
predict the test problem performance 
of subjects who have gone through 
gradual transition pretraining.® This 
prediction is compared with observed 
performance in Table 3. It may be 
noted that the correspondence be- 
tween prediction and observation is 
in this case very close. Again, how- 
ever, the prediction is consistently a 


little lower than observed perform- 
ance. 


* The general prediction for transfer through 
a series of problems which get successively 
more difficult can be derived by following 
through and repeating the reasoning in foot- 
note 4. Since the resulting equations are 
extremely large and can be derived rather 
easily, they are not given here. 
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NEW DATA 


The theory has thus far been tested 
against the behavior of rats. Its 
generality is now tested with college 
students in a simple discrimination 
learning task. 


Subjects and procedure. The subjects in 
this experiment were 23 students in the ele- 
mentary psychology course at Stanford Uni- 
versitv,. The S was seated at one end of a 
table and told that his responses could be 
* ov BB On each trial S saw a 
single stimulus, which was a black square on 
a circular white background. The two 
squares used on alternate trials differed in 
size. In problem si: the squares differed in 
height by ¥ in, in problem ss thev differed 
by 2 in. The mean height of each pair of 
squares was 3 in. The squares were viewed 
at a distance of about 6 ft. 

For half the Ss in each experimental group, 
the problem was to say “4A to the smaller 
square and “B" to the larger one. The other 


cither 


Ss had the converse problem. The S was 
never told that the problem was a size dis- 
crimination. Stimuli were alternated ran- 


domly. A rest period was called after each 
ten trials and S was asked what he thought 
the correct solution to the problem was, and 
to outline possible solutions which had oc- 
curred to him. This method of questioning 
is a modification of Prentice's method (8). 

Twelve Ss were trained first on problem s: 
toa criterion of 15 successive correct responses 
and then transferred to problem ss» and run to 
the same criterion. ‘These Ss made up the 
“Easy-Hard Transfer Group" called EH. 
The other 11 Ss were trained first on s> and 
then transferred to si1. This was the “‘Hard- 
Easy Transfer Group" called ITE. The two 
Kroups were approximately equated for age, 
sex, and known special visual skills. 


Results. Using the pretraining per- 
formance of the EH group, the aver- 
age proportion of relevant cues, 01, 
was estimated at .254 by equation 8. 
Using the pretraining performance of 
the ITE group, the average proportion 
of relevant cues in problem s2 was 
estimated at 0» = .138. 

The transfer performance of group 
EI, which first learned the easy and 
then the hard problem, is predictable 
by equation 10. Since these subjects 


worked to a high criterion in pretrain- 
ing, we can assume that p(n) is 
negligibly different from one at the 
end of pretraining. Then by equa- 
tion 7 we see that (1 — 0:1)"—!is small, 
and equation 10 simplifies to 


0:+2(1—0:)-'(0,—0:) 
0:4 (1—0:)='(0,—012) 


P(n+j) = 2 Bs ool 
This theoretical function of j is com- 
pared with observed transfer per- 
formance in Table 4. It is seen that 
the correspondence is quite close with 
a negligible constant error. 

This prediction is based on the 
formula which also predicted Law- 
rence's rat data. This confirmation 
suggests that the law can be applied 
to human as well as rat performance 
on this tvpe of task. 

Using the line of reasoning which 
developed equation 10 we can pro- 
duce an equation to predict transfer 
performance from hard to easier prob- 
lems of the same sort. Certain cues 
are relevant in the easy problem 
which were irrelevant in the harder 
one. These cues cannot be identified 
in the hard problem. For perform- 
ance to be perfect in the easier prob- 
lem all relevant cues must be identi- 
fied. Therefore, when the subject 
transfers from the hard to the easier 


TABLE 4 
PREDICTION OF TRANSFER OF TRAINING FROM 
EASIER TO HARDER PROBLEM IN 
HUMAN SUBJECTS 


Trials after | Proportion of Correct Responses 
Transfer to 
Second Problem | Observed Predicted 
1-5 817 82 
6-10 933 895 
11-15 .926 941 
16-20 933 .966 
21-25 .966 .988 
26-30 .983 .994 
31-35 1.000 1.000 
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TABLE 5 


PREDICTION OF TRANSFER OF TRAINING FROM 
HARDER TO EASIER PROBLEM IN 
HUMAN SUBJECTS 


P a 
Trials After 'roportion of Correct Responses 


Transfer to 
Second Problem 


Observed Predicted 
1-4 -932 883 
5-8 955 .960 
9-12 -955 984 
13-16 1.000 995 


problem we should expect some small 
number of errors to be made. On the 
assumption that the hard problem 
was completely learned in pretraining, 
the formula for transfer performance 
on the easy problem is 


where 6; is the proportion of relevant 
cues in the easy problem and 0 is 
the proportion of relevant cues in the 
harder problem. The proof of this 
theorem is similar to that of equation 
12 above, and is not given here. 

Equation 13 yields the prediction 
for transfer performance of the HE 
subjects. In Table 5 the prediction 
is compared with observed transfer 
performance. 

Despite the very small frequencies 
predicted and observed, the predic- 
tion is quite accurate. In all, seven 
errors were made by eleven subjects, 
whereas a total of eight were expected. 
This is an average of .64 errors per 
subject observed, and .73 predicted. 


b(n+j) = 


Discussion 


The definition of a “cue” in terms 
of possible responses is selected be- 
cause the theoretical results do not 
depend critically upon the nature of 
the stimulating agent. While cues 
are thought of as stimulus elements, 


these elements need not be of the 
nature of “‘points of color" or “‘ele- 
mentary tones." Jf a subject can 
learn a consistent response to a certain 
configuration despite changes in its 
constituents, then the configuration 
is by definition a cue separate from 
its constituents. The intention is to 
accept any cue which can be demon- 
strated to be a possible basis for a 
differential response. 

The process of conditioning de- 
scribed in this paper is formally 
similar to the processes of condi- 
tioning of Estes (5) and Bush and 
Mosteller (2,3). In the present 
theory conditioning takes place at 
each trial, not only on “‘reinforced"' 
trials. In earlier theories condition- 
ing is said to occur only on such rein- 
forced trials. In two-choice discrimi- 
nation the incorrect response has a 
high initial probability (one-half) be- 
cause of the nature of the physical 
situation and the way of recording 
responses. Therefore, a theory of 
two-choice learning must account for 
the consistent weakening of such re- 
sponses through consistent nonrein- 
forcement. 

The notion of adaptation used here 
is formally analogous to the operation 
of Bush and Mosteller’'s Discrimina- 
tion Operator “D"” (3). However, 
whereas Bush and Mosteller’s operator 
is applied only on trials in which the 
reward condition is reversed for a cue, 
the present theory indicates that this 
process takes place each trial. In 
addition, while the Discrimination 
Operator and the process of adapta- 
tion are both exponential in form, 
Bush and Mosteller introduce a new 
exponential constant k for this pur- 
pose and the present theory uses the 
conditioning constant 6. 

The major point differentiating the 
present theory from similar earlier 
theories is the use of the strong sim- 
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plifying assumption identifying the 
exponential constant 0 with the pro- 
portion of relevant cues. This as- 
sumption may appear intuitively un- 
likely, but if it should be shown by 
further experiment to be tenable, the 
predictive power of discrimination 
learning theory is enhanced. There 
seems to be no reason for abandoning 
so useful an assumption unless experi- 
mental results require it. 


SUMMARY 


A theory of two-choice discrimina- 
tion learning has been presented. 
The theory is formally similar to 
earlier theories of Estes (5) and Bush 
and Mosteller (3) but differs some- 
what in basic concepts and uses a 
new simplifying assumption. 

From this theory three empirical 
laws are derived: one dealing with the 
combination of relevant cues, and two 
dealing with a special type of transfer 
of training. These laws permitted 
quantitative predictions of the be- 
havior of four groups of rats and two 
groups of human subjects. Five of 
these six predictions were quite accu- 
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rate, and the sixth was within the 
range of reasonable sampling devia- 
tion. 
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THE ROLE OF OBSERVING RESPONSES IN 
DISCRIMINATION LEARNING! 


PART I 


BY L. BENJAMIN WYCKOFF, JR. 


University of Wisconsin 


Theorists in the area of discrimina- 
tion learning have often had occasion 
to refer to a set or predisposition of .S 
to learn differential responses to a par- 
ticular pair of stimuli. Such a pre- 
disposition has often been attributed 
to some reaction of S such as an at- 
tending response, orienting response, 
perceiving response, sensory organiza- 
tional activity, etc. To implement 
the discussion of the role of such re- 
actions in discrimination learning we 
shall adopt the term “observing re- 
sponse" (R,) to refer to any response 
which results in exposure to the pair 
of discriminative stimuli involved. 
The probability of occurrence of an 
observing response will be denoted by 
Po. These responses are to be dis- 
tinguished from the responses upon 
which reinforcement is based; that is, 
running, turning right or left, lever 
pressing, etc., which, for convenience, 
we shall term “effective responses." 

Spence (19) has proposed a theory 
of discrimination which is specifically 
intended to deal with situations where 
no observing response is required of 5S, 
that is to say, to situations in which S§ 
is certain to be exposed to the dis- 
criminative stimuli on each trial or 
prior to each effective response (po, = 
1). The fact that in some discrimina- 
tion experiments this condition has 
not been satisfied has become an issue 


1 This paper is submitted in partial fulfill- 
ment of the requirements for the degree of 
Doctor of Philosophy, in the Department of 
Psychology, Indiana University. The writer 
wishes to express his appreciation to Dr. C. Y. 
Burke for his invaluable guidance and stimu- 
lation. 
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in the literature, largely because it 
became necessary to delimit clearly 
the situations to which Spence's 
theory is intended to apply. 


Spence's theory of discrimination states 
that stimulus-response connections are 
strengthened or weakened during discrimi- 
nation training in essentially the same way 
as these changes would occur during condi- 
tioning or extinction. When a response is 
reinforced the connections between it and 
all aspects of the stimulus situation im- 
Pinging on S at the time the response oc- 
curred will be strengthened. These connec- 
tions will be weakened when the response 
is not reinforced. Certain implications of 
this theory were questioned by Krechevsky 
(11) and other theorists, and became the 
subject matter of the “continuity-disconti- 
nuity” controversy. This material has been 
reviewed a number of times (2, 5) and 
need not be repeated in detail here. One 
aspect of the controversy is pertinent to 
the present discussion. Krechevsky (12) 
Presented experimental findings which indi- 
cated that rats learned nothing with respect 
to two stimulus patterns during the first 20 
trials of a discrimination experiment even 
though they were systematically reinforced 
for approaching a particular pattern dur- 
ing this interval. Failure to learn was es- 
tablished by showing a lack of interference 
when Ss were tested on a reversed dis- 
crimination. These findings were in appar- 
ent disagreement with the data obtained by 
McCulloch and Pratt (13) in a similar ex- 
periment in which differing weights were 
used as discriminative stimuli. Here in- 
terference was obtained, indicating that 
some cumulative learning had occurred in 
the early portion of the experiment. 

In interpreting these results, Spence (20. 
P. 277) argued that the stimuli (patterns) 
used by Krechevsky were not sufficiently 


Reprinted with permission. 
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coaspicuous to provide a legitimate test of 
his theory. He suggested that Ss had not 
learned to orient toward the stimuli within 
the first 20 trails. He points out that in 
such cases, “‘. . . the animal must learn to 
orient and fixate its head and eyes so as to 
receive the critical stimuli.” He then sug- 
gests a way in which this learning may 
occur. “These reactions are learned... 
because they are followed within a short 
temporal interval by the final goal re- 
sponse.” 

This interpretation was put to an experi- 
mental test by Ehrenfreund (5). In his 
experiment the likelihood of S’s receiving 
the critical stimuli was manipulated by 
changing the position of the stimuli (up- 
right and inverted triangles) with respect 
to the landing platform of a jumping stand. 
The design of the experiment was essen- 
tially the same as Krechevsky’s. The re- 
sults conform to Spence’s interpretation. 
When the stimuli were placed relatively 
high, no learning occurred within the first 
40 trials, whereas when the stimuli were 
placed closer to the landing platform learn- 
ing did occur. Learning was again meas- 
ured in terms of interference in the learn- 
ing of a subsequent reversed discrimination. 


The analysis of discrimination situ- 
ations in which some observing re- 
sponse is required is of interest for 
several reasons. First, discrimination 
learning in situations other than labor- 
atory experiments, such as human 
learning in the course of every day 
events, is largely of this kind. Sec- 
ondly, even in the most closely con- 
trolled laboratory experiments it is 
seldom, if ever, possible to say with 
certainty that S is exposed to the dis- 
criminative stimuli prior to each effec- 
tive response. In the case of pattern 
discriminations it has been demon- 
strated by Ehrenfreund (5) that rela- 
tively small differences in the position 
of the discriminative stimuli will 
effect discrimination learning, indicat- 
ing that relatively precise fixation of 
the stimulus is required. 

In the present paper an attempt 
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will be made to develop a more ex- 
tensive theory of discrimination which 
will include situations in which some 
observing response (hereafter referred 
to as R.) is required before S is ex- 
posed to the discriminative stimuli. 
An example of such a situation would 
be an experiment in which stimulus 
cards were placed overhead. In this 
case the response of raising the head 
would be the Ro). 

If we accept the notion that changes 
in po can be accounted for within the 
framework of reinforcement learning 
theory, it should be possible to devise 
a theory of discrimination which will 
include those cases where some R, is 
necessary. The purpose of this paper 
is to outline such a theory. We shall 
see that by analyzing discrimination 
learning in this way it will be possible 
to account for stimulus generalization 
and also changes in generalization dur- 
ing discrimination learning without 
postulating any direct interaction be- 
tween stimuli. Several hypotheses 
will be derived from this theorv which 
have been tested in an experiment by 
the author presented in detail else- 
where (22). Finally we shall outline 
a way in which the present theorv can 
be integrated with existing quantita- 
tive theories of conditioning and ex- 
tinction to form a quantitative theory 
of discrimination. 

To simplify this discussion let us 
consider a hypothetical experiment 
using a situation similar to that used 
by Wilcoxon, Hays, and Hull (21), 
and later used by Hull (10) for a dis- 
crimination experiment. In this ex- 
periment a rat was placed in a small 
compartment with a single exit 
through a door into a goal compart- 
ment. A measure of the latency of 
the response of running through this 
door was obtained. The discrimina- 
tive stimuli consisted of a black or a 
white door, either one of which was 
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present on each trial. During dis- 
crimination training the running re- 
sponse was reinforced with food when 
one color was present, whereas rein- 
forcement was withheld when the 
other color was present. Each stimu- 
lus was present on an average of 50 
per cent of the trials. 

For purposes of the present dis- 
cussion let us consider a slightly differ- 
ent situation in which the discrimina- 
tive stimuli are placed overhead 
rather than directly in frontof S. In 
this case an observing response, 
raising the head, will be necessary if 
S is to be exposed to the discrimina- 
tive stimuli. On each trial, when Sis 
placed in the apparatus, there will be 
a certain probability that the R, of 
looking up will occur. When R, does 
occur S will be exposed either to a 
black or a white card. When the R, 
fails to occur S will not be exposed to 
either card, but rather to a neutral 
population of stimuli (walls, floor, 
etc.). Note that in this situation S 
does not improve its chances of ulti- 
mate reinforcement by making the R.. 
The food is placed in the goal com- 
partment whenever the white card is 
present whether S actually looks up 
or not. In a sense then, S gains only 
information by making the R.. 

We are now in a position to examine 
the relation between observing re- 
sponses and stimulus generalization. 
In general it is apparent that if p, has 
a low value, S will seldom be exposed 
to the discriminative stimuli (the 
black and white cards). S therefore, 
will have minimum opportunity to 
learn discrimination or to manifest 
any discrimination already learned. 
On the other hand, if p, has a high 
value, the opportunity to learn or 
manifest discrimination will be large. 

Stimulus generalization between 
two stimuli is usually defined either 
in terms of S's tendency to respond 
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similarly to the two stimuli, or in 
terms of failure to learn differential 
responses readily. Thus we can see 
that stimulus generalization will de- 
crease as po increases. 

If we assume that p, changes as a 
result of learning processes we can see 
that these changes would give rise to 
changes in generalization between the 
stimuli involved. More specifically, 
if we assume that po will increase dur- 
ing discrimination learning (differen- 
tial reinforcement), generalization be- 
tween the discriminative stimuli will 
decrease. Similarly, we might as- 
sume that po, will decrease if we intro- 
duce a procedure in which the subject 
is reinforced equally often in the pres- 
ence of either stimulus (non-differ- 
ential reinforcement). This decrease 
in po would give rise to an increase in 
generalization between the stimuli. 

In the case of the hypothetical ex- 
periment suggested above, generaliza- 
tion will be shown in a “‘crossover" 
effect between positive and negative 
trials. Reinforcements on positive 
trials (positive stimulus card present 
but not necessarily observed) will 
tend to strengthen the effective re- 
sponse on negative trials, while unrein- 
forced responses on negative trials will 
tend to weaken the effective response 
on positive trials. If S's tendency to 
look up increases during differential 
reinforcement, this “‘crossover’’ effect 
will decrease. If during non-differ- 
ential reinforcement the tendency to 
look up decreases, the “crossover” 
effect will increase. 

It should be emphasized that these 
statements regarding increases and 
decreases in po are, at this point, as- 
sumptions which may or may not be 
true in a particular experimental situ- 
ation. We shall present experimental 
findings which suggest that these as- 


sumptions are quite generally true 
below. 
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In the above discussion we have 
considered the effects of Ro, on dis- 
crimination and generalization. At 
this point we turn our attention to the 
problem of accounting for changes in 
Po within the framework of reinforce- 
ment learning theory. Our problem 
will be to identify possible reinforcing 
conditions which may account for in- 
creases in po during differential rein- 
forcement. 

First we note that, by definition, 
the observing response results in ex- 
posure to a pair of discriminative 
stimuli. If exposure to these stimuli 
is in some way reinforcing, we shall 
expect po to increase or remain high. 
The problem at hand is to show how 
exposure to discriminative stimuli 
may have a reinforcing effect under 
the condition of differential reinforce- 
ment, while the same stimuli do not 
have this effect under the condition of 
non-differential reinforcement. Re- 
inforcement theory provides two ways 
of accounting for this reinforcing 
effect. 

The first method is the mechanism 
suggested by Spence when he states 
that observing responses are learned 
‘‘hecause they are followed within a 
short temporal interval by the final 
goal response'' (19). This mechanism 
will operate in experiments such as a 
“jumping stand’ experiment, in 
which exposure to discriminative stim- 
uli may serve to increase the prob- 
ability of prompt reinforcement, that 
is to say, the probability of the “‘cor- 
rect" jump may be increased. Spence 
offered this suggestion in relation to a 
jumping stand experiment. 

The second method of accounting 
for the reinforcing effect is by appeal 
to the principles of secondary rein- 
forcement. Here we suggest that the 
discriminative stimuli themselves take 
on secondary reinforcing value during 
the course of discrimination learning. 
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It has been demonstrated that an origi- 
nally neutral stimulus which accompanies 
reinforcement may acquire secondary re- 
inforcing properties. That is, it may serve 
to strengthen a response upon which it is 
made contingent. Skinner (18, p. 246) 
has demonstrated that whenever a stimulus 
becomes a discriminative stimulus for some 
response in a chain leading ultimately to 
reinforcement, this stimulus will serve as a 
secondary reinforcing stimulus. The con- 
ditions necessary for the formation of sec- 
ondary reinforcing properties are further 
considered by Notterman (16), Schoenfeld 
et al. (17) and Dinsmoor (4). They point 
out that in all cases where secondary rein- 
forcement has been demonstrated, the con- 
ditions were also appropriate for the estab- 
lishment of the stimulus in question as a 
discriminative stimulus. They suggest that 
this may be a necessary (as well as suff- 
cient) condition for the establishment of 
secondary reinforcing properties. In the 
present formulation it is apparent that the 
positive stimulus is presented in the ap- 
propriate temporal position to become both 
a discriminative stimulus (for the effec- 
tive response) and a secondary reinforcing 
stimulus (for the observing response). 


This mechanism may operate in 
any situation whatever where an R, 
is involved, since it is a defining char- 
acteristic of the R, that it leads to ex- 
posure to discriminative stimuli. 
Specifically it should apply to the 
hypothetical experiment suggested 
above. Here the effective response 
(running) will always be reinforced 
when S is exposed to the white card. 
Hence the white card could be ex- 
pected to acquire secondary reinforc- 
ing value. It is not sufficient to show 
simply that the positive stimulus will 
acquire secondary reinforcing value. 
We must also consider two other 
factors. First, R, results in exposure 
to the positive stimulus only 50 per 
cent of the time. Jt results in ex- 
posure to the negative stimulus the 
other 50 per cent. Second, the run- 
ning response is reinforced sometimes 
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when 5S is exposed to the neutral 
stimulus population, since, on positive 
trials, the running response is rein- 
forced even though S does not look 
up. The effective response is rein- 
forced most consistently when S is 
exposed to the positive stimulus. 
Therefore, it is still plausible to 
postulate that the intermittent ex- 
posure to the positive and negative 
stimuli will have a net reinforcing 
effect on R.. 

It is true of both of these mecha- 
nisms that, before any increase in po 
can be expected to occur, S must learn 
differential effective responses, that 
is to say, S must learn to respond 
differently to the two discriminative 
stimuli. In the case of the “jumping 
stand" experiment, if S does not have 
differential jumping tendencies to- 
ward the discriminative stimuli, the 
probability of reinforcement will al- 
ways be 50 per cent, and will not be 
improved by the occurrence of R.. 

When we apply the secondary rein- 
forcement principle we can see that 
the positive stimulus must appear in 
the proper temporal relation to rein- 
forcement a number of times before 
this stimulus will acquire secondary 
reinforcing properties. In terms of 
Notterman, Schoenfeld, and Dins- 
moor's interpretation it will be nec- 
essary for S to learn differential effect- 
ive responses to the discriminative 
stimuli before secondary reinforcing 
properties are acquired by these 
stimuli. 

In view of these considerations we 
introduce the following general hy- 
pothesis: Exposure to discriminative 
stimuli will have a reinforcing effect 
on the observing response to the ex- 
tent that S has learned to respond 
differently to the two discriminative 
stimuli. 

Hereafter we shall refer to the 
magnitude of the difference between 
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Ss’ tendencies to respond to the two 
discriminative stimuli as the “‘degree 
of discrimination." 

Earlier it was pointed out that the 
probability of occurrence of R, is one 
of the factors determining the rate of 
formation of discrimination. Accord- 
ing to the present hypothesis the op- 
posite relationship is also true. The 
resulting picture is one of a circular 
interrelationship, in which R, affects 
the formation of discrimination be- 
cause of its effect on exposure to dis- 
criminative stimuli, while the degree 
of discrimination affects R, through 
another mechanism involving either 
secondary reinforcement or changes 
in the probability of reinforcement. 

We now present four propositions 
which are implied by this general 
hypothesis. The hypothesis was 
formulated partly on the basis of 
experimental evidence already avail- 
able, which suggested that these 
Propositions were true (22). At pres- 
ent we shall consider them as specific 
hypotheses. The first two of these 
have already been introduced as as- 
sumptions. 


1. po will increase (or remain high) 
under conditions of differential rein- 
forcement. 

2. po will decrease (or remain low) 
under conditions of non-differential 
reinforcement. 


It is apparent that these hypotheses 
are consistent with the general hy- 
pothesis since the degree of discrimi- 
nation will tend to increase (or remain 
high) under differential reinforcement, 
while it will tend to decrease (or 
remain low) under nondifferential 
reinforcement. In other words, S 
will learn to respond differently to the 
two stimuli under differential rein- 
forcement, but will learn to respond 
in the same way to them under non- 
differential reinforcement. Additional 
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hypotheses of interest can be derived 
from this general hypothesis. 

3. When a well established dis- 
crimination is reversed po will de- 
crease temporarily and then return to 
a high value. 

We shall expect this change in po 
because, following a reversal, the de- 
gree of discrimination will decrease 
as the original discrimination van- 
ishes. It will then increase as the 
new discrimination is formed. 

4. If at some point in an experiment 
the degree of discrimination is low and 
at the same time p, is low (but greater 
than zero), we shall expect the forma- 
tion of discrimination to be retarded 
for some interval, but finally to occur 
quite rapidly. 


This hypothesis arises from the fact 
that increases in the degree of dis- 
crimination, and increases in po, are 
dependent upon each other. Early in 
the process S will be exposed to the 
discriminative stimuli only a small 
proportion of the time and hence the 
degree of discrimination cannot in- 
crease rapidly. At the same time po 
will not increase because of the low 
degree of discrimination. Then, as 
the degree of discrimination becomes 
sufficiently great to bring about an 
increase in po the entire learning proc- 
ess will be accelerated. 


Krechevsky (11) presents data obtained 
in discrimination experiments in a jumping 
stand situation which correspond in some 
respects to the predictions of the pres- 
ent formulation. Curves for individual Ss 
show relatively abrupt discrimination for- 
mation. In general the curves also show 
a slight improvement in discrimination 
prior to the abrupt change. A curve pre- 
sented for discrimination reversal shows a 
rapid decrease in the degree of discrimina- 
tion to a chance level, followed by an in- 
terval during which improvement was much 
less rapid. Finally the process accelerated 
as the reversed discrimination formed. 
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Krechevsky also noted that during the in- 
terval while S was responding approxi- 
mately according to chance with respect to 
the discriminative stimuli, he showed a 
strong position preference. These findings 
are in complete agreement with hypotheses 
3 and 4 in the present formulation. 


The four hypotheses presented so 
far were tested in an experiment by 
the writer (22) which is presented in 
detail elsewhere. In this experiment 
direct measures of an R, were obtained 
during differential reinforcement, non- 
differential reinforcement and dur- 
ing discrimination reversal. Pigeons 
were used in a Skinner-box situation 
in, which the effective response was 
striking a single translucent key. 
The discriminative stimuli were col- 
ored lights (red and green) projected 
on the back of the key one at a time. 
The colored lights were withheld and 
the key was lighted white until the 
R., occurred. The R, consisted of 
stepping on a pedal on the floor of the 
compartment. The reasons for using 
this response as an observing response 
are discussed in detail elsewhere (22). 
Here it will suffice to say that this re- 
sponse falls within our definition of an 
observing response in that it resulted 
in exposure to the discriminative 
stimuli. As in the case of the hy- 
pothetical experiment discussed 
above, the observing response had no 
effect on the probability of reinforce- 
ment at any given moment. 

All of the above hypotheses were 
supported by the results of this ex- 
periment. Concerning the first three 
hypotheses, p, was higher under differ- 
ential reinforcement than under non- 
differential reinforcement. When Ss 
were shifted from differential to non- 
differential reinforcement a marked 
decrease in p, occurred. All of these 
differences were significant at a 5 per 
cent level of confidence or better. 

The fourth hypothesis does not ap- 
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ply unless at some point in the experi- 
ment the degree of discrimination and 
bo are both low. This condition was 
not satisfied consistently since the 
operant (or base) level of the pedal 
response turned out to be relatively 
high for Ss. However, in several 
cases this condition was satisfied and 
in these cases the results conformed 
to the hypothesis. 

We can now illustrate some ways 
in which this theory might be useful 
in interpreting behavior in other ex- 
periments. 


1. If this theory is applied to situ- 
ations in which more than one pair of 
discriminative stimuli is involved we 
can make some predictions regarding 
changes in the readiness of S to form 
discriminations based on some par- 
ticular pair of stimuli. 

2. It has been demonstrated that 
when a discrimination is reversed re- 
peatedly Ss tend to learn the reversed 
discrimination more and more rapidly 
(15, 8). According to the present 
theory, during discrimination reversal 
the observing response is partially 
extinguished and reconditioned. 
Thus, during repeated reversals, the 
R, is, in effect, reinforced intermit- 
tently. Studies of intermittent rein- 
forcement have indicated that when a 
response is intermittently extin- 
guished and reconditioned, the 
strength of the response tends to at- 
tain a relatively constant high value 
(18). On the first reversal bo might 
drop to a low value, and recover 
slowly, but with repeated reversals 
we would expect this drop to become 
less Prominent, and finally, p, would 
remain high throughout the reversal. 
It is apparent that if po remained 
high, a reversed discrimination would 


be learned more rapidly than other- 
wise. 


In the preceding discussion we have 
examined some of the ways in which 
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discrimination learning may be af- 
fected when some observing response 
is required of S. We shall now derive 
some quantitative statements to sup- 
plement the above analysis. We 
shall attempt to set down the relation- 
ships involved in such a way that the 
present theory can be readily inte- 
grated into existing quantitative the- 
ories of learning such as Hull's (9), 
Estes' (6) or Bush and Mosteller's 
(3). The potential applications of 
this development could proceed along 
two different lines. 

First, we could attempt to state the 
relationships between observing re- 
sponses and measurable aspects of the 
effective responses in such a way that 
bo Could be estimated in situations 
where direct measurement of Ro is 
not feasible. This might be the case, 
for example, if the R, involved focus- 
ing of the eye. If we apply the pres- 
ent development in this way, peo 
would become an intervening vari- 
able, which could be used to account 
for and predict behavior in situations 
where (1) the apparent generalization 
between stimuli changes, or (2) where 
the ease of formation of discrimination 
changes as a function of training. 
Berlyne (1) suggests that “‘attention" 
be treated in a similar way. 

Secondly, we could predict dis- 
crimination learning functions by 
adopting some set of assumptions re- 
garding the component learning pro- 
cesses involved. These assumptions 
could be adopted from some existing 
theory which treats the simpler proc- 
esses of conditioning and extinction. 
The main obstacle to this endeaver at 
the moment is the absence of any 
quantitative function for predicting 
changes in po. However, we sha 
be able to set down the relationships 
involved in such a way that any ac- 
ceptable function can immediately be 
inserted. 
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QUANTITATIVE ANALYSIS 


For purposes of this analysis let us 
return to consideration of the hypo- 
thetical experiment discussed above. 
There it was pointed out that we must 
take into consideration three different 
stimulus populations which may effect 
Ss’ behavior. We shall adopt the fol- 
lowing notation to represent these 
stimuli. Let S; represent the stimu- 
lus population to which Sis exposed on 
trials when the R, occurs and when 
the positive stimulus card (white) is 
present, S» represent the stimulus 
population on trials when the Ro oc- 
curs and when the negative stimulus 
card (black) is present, and Ss the 
stimulus population to which S is 
exposed when the R, fails to occur. 

In this analysis we shall use the 
symbol p to represent the probability 
of occurrence of the effective response 
at any given moment during a trial. 
This variable can be related to the 
variable of response latency as follows. 
Estes (6) has shown that if a response 
can be expected to occur with a given 
probability at any moment during a 
trial, the mean latency of the response 
will be proportional to the reciprocal 
of the probability; that is to say, 
L = /p, where Lis the mean latency, 
2 the probability, and k a constant of 
proportionality which will depend on 
the units of measurement used. In 
the present case we must consider the 
probability of occurrence of the effect- 
ive response for each of three stimulus 
populations. Let us adopt the sym- 
bols pi, ps2, and ps to represent the 
probability of occurrence of the effect- 
ive response when S is exposed to Si, 
S22, and Ss, respectively. We shall 
also wish to refer to the net probabil- 
ity of occurrence of the effective re- 
sponse on a given trial, taking into 
account that S may be exposed to 
different stimuli during the trial de- 
pending on the occurrence or non-oc- 
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currence of the R,. We shall use the 
symbols p+ and p-_ to represent the 
net probability on trials when the 
positive or negative stimuli are pres- 
ent. 

To summarize: 


5S; = the population of stimuli to 
which 5S is exposed if (1) the 
positive stimulus is present and 
(2) the R, occurs. 

S2 = the population of stimuli to 
which S is exposed if (1) the 
negative stimulus is present 
and (2) the R, occurs. 

Ss = the population of stimuli to 
which Sis exposed if the observ- 
ing response fails to occur. 

p = the probability that the effect- 
ive response will occur at any 
given moment during a trial 
(= &/L) 

bi = the value of p when S is ex- 
posed to S; 

b? = the value of p when JS is ex- 
posed to S2 

ps = the value of p when Sis exposed 
to 5; 

b+ = the net value of p for a trial on 
which the positive stimulus is 


present 

p- = the net value of p for a trial on 
which the negative stimulus is 
present 


po = the probability of occurrence of 
R, at any given moment during 
a: trial. 


We shall now express certain func- 
tional relationships among these vari- 
ables. First we shall express p+ and 
b- as two functions of the variables 
bri, be, ps, and p.. p+ and p_ are 
variables which can be evaluated 
from experimental measures, such as 
latency of the effective response, with- 
out reference to direct measures of 
R.. They correspond to the meas- 
ures of response tendency usually ob- 
tained in discrimination experiments. 
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However, in the present framework 
b+ and p_ are assumed to be the net 
result of the operation of the variables 
Pi, ba, pa, and po. Our task will be to 
express this dependence as a pair of 
functional relationships. This can be 
done as follows. 

Consider a selected moment during 
a positive trial. At this moment S 
will be exposed to either S;, with a 
probability of po, or to Ss, with a 
probability of (1 — p.). If S is ex- 
posed to S; he will make the effective 
response with a proabability of pi. 
If the effective response and R, are 
independent of each other the prob- 
ability that both R, and the effective 
response will occur will be the product 
pips. Hf S is exposed to Ss; he will 
make the effective response with a 
probability of ps3, and the probability 
that both will occur will be the prod- 
uct (1 — po)bs. The total probabil- 
ity that the effective response will 
occur at this moment will be the sum 
of these products. Thus: 


p+ = bohi t+ (1—2po)bs. (1) 


By exactly parallel reasoning with 
respect to a selected moment during 
a negative trial we obtain: 


b- = Bobs (1 — pols (2) 


The next step will be to derive ex- 
pressions for predicting the values of 
pu, p?, and ps3. The reinforcement 
contingencies for the effective response 
in the presence of S;, S2, and S3 can 
be readily ascertained. It will be 
possible to predict changes in the 
values of pi, ps, and ps on the basis of 
learning functions for the simpler 
processes of conditioning and extinc- 
tion if we assume that learning with 
respect to each of these stimuli, pro- 
ceeds independently of learning with 
respect to the others. This assump- 
tion implies that interaction between 
stimuli will have a negligible effect. 
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However, in making this assumption 
we do not forfeit the ability to handle 
stimulus generalization within the 
present framework, since, as we have 
already pointed out, stimulus general- 
ization can be accounted for without 
postulating any such direct interac- 
tion. 

In the present paper we do not 
adopt a particular set of functions for 
conditioning and extinction, but at- 
tempt to set down the relationships 
in such a way that any acceptable set 
of functions can be immediately in- 
serted. 

The assumption of “negligible di- 
rect interaction" implies that changes 
in the probability of occurrence of 
the effective response with respect to 
a particular stimulus population S; 
(1 = 1, 2, or 3) will occur only during 
the time in which S is exposed to S;, 
and that the rate of change with 
respect to time will depend on: 


1. Whether or not the effective 
response is reinforced. 
2. The value of p; at the time. 


If we let r; represent the proportion 
of the time during which Sis exposed 
to S;, the rate of change of Pi can be 
approximated by two functions as 
follows: 


dpi/dt = rife(p:) (3) 


if the effective response is reinforced, 
and 


dpi/dt = rif(p:) (4) 


if the effective response is not rein- 
forced. 

The functions fe and f. represent 
any acceptable set of analytic func- 
tions which approximate the rate of 
change of Probability of occurrence 
of an effective response during condi- 
tioning and extinction, respectively. 
It will be noted that if we assign a 
value of 1 to ? we will obtain expres- 
sions for simple cases of conditioning 
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or extinction. In the present model 
the values of r; can be expressed as 
functions of p. as follows. The posi- 
tive and negative stimuli are each to 
be present 50 per cent of the time. 
During this time the subject will be 
exposed to Si; or Ss with a probability 
of po. Hence: 


ri = r2 = .5p.. 


S will be exposed to Ss; with a prob- 
ability of (1 — p.). Hence: 


rs = (1 -— 2b). 


We also know that all effective re- 
sponses in the presence of S; are rein- 
forced, effective responses in the 
presence of 5S» are not reinforced, and 
effective responses in the presence of 
S; are reinforced an average of one- 
half of the time. Using the above 
values of r and appropriate functions 
for reinforced and non-reinforced re- 
sponses we obtain: 


dpi/dt = .Spofe(p1) (5) 
dpa/dt = .Spofe(b2) (6) 
dps/dt = .S(1 — po)fe(ba) 


+ .5(1 — po)lfe(ba). (7) 


We can now outline the steps which 
would be necessary to predict p+ and 
P- (measurable aspects of effective 
responses) if we can predict the values 
of po as a function of time. Such a 
function could be derived empirically 
or through some theoretical state- 
ment regarding the factors which 
bring about changes in po. If po can 
be expressed as a function of time we 
can rewrite equations 5, 6, and 7 to 
obtain expressions involving only 
dpi, dt, pi and t. Tf these differential 
equations can be solved we will ob- 
tain p; = f(t). Thus we can obtain 
values of pi, b?, pa, and po for any 
point in time. These values can be 
substituted in equations 1 and 2 to 
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give the desired prediction of p. and 


D-. 
On the other hand if we wish to 


estimate the values of p, from known 
values of p+ and p_, we can proceed 
as follows. 

Equations 1 and 2 state: 


P+ = bobi + (1 — po)bs (1) 
P- = bob? + (1 — po)bs. (2) 
Differentiating with respect to time 
we obtain: 
dpi/dt = poldpi/dt) + Ppi(dps/at) 
+(1-—Ppo)(dps/dt)—paldpe/dt), (8) 
dp_/dt= poldp2/dt) + p(dpo/dt) 
+(1-—p)(dps/dt)—paldpe/dt). (9) 
Substituting values for dpi/dt, dps/ 
dt and dps3/dt from equations 5, 6, and 
7 and rearranging terms we obtain: 
dpi/dt Rs .Spotfe(p1) 
+.5(1— po)°[fe(ps) +fe(P3) J 


+(pi—ps)(dp/dt), (10) 
dp_/dt=.Spofelps) 
+.5(1- Po)°Lfe(ps) +fe(Pa) J 
+(p:—ps)(dpe/dt). (11) 


Equations 1, 2, 10, and 11 represent 
four simultaneous equations. By 
combining these equations we can 
express pi, p2, and ps as functions of 
the other variables and obtain a 
single expression : 


dps/dt = Gps, b-, 
dpi/dt, dp_/dt, po), (12) 


where the function G will depend on 
the functions fe and f. adopted for the 
conditioning and extinction functions. 

Now, if the curves representing the 
values of p; and p_ are determined 
experimentally, we can express these 
variables as analytic functions of time. 
We can also obtain expressions for 
dp;/dt and dp_/dt as functions of time. 
Substituting the functions for p+, 
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p-, db+/dt and dp_/dt in equation 12 
we obtain: 


dp./dt = G'(, po). (13) 


If this differential equation can be 
solved we obtain: 


bo = fall). (14) 


This equation will give us the 
desired value of p, for any point in 
time during the experiment. 


SUMMARY 


In many discrimination learning 
situations some response, such as an 
orienting response, will be required of 
S before he is exposed to the dis- 
criminative stimuli. We call these 
responses “‘observing responses’ (Ro), 
and indicate their probability of oc- 
currence as p,. Increases in po will 
result in increased exposure to the dis- 
criminative stimuli, and hence in- 
creased opportunity for S to learn or 
manifest discrimination. Decreased 
po will have the opposite effect. 
These results are operationally equiv- 
alent to decreases or increases in 
stimulus generalization between the 
discriminative stimuli. The follow- 
ing general hypothesis regarding 
changes in p, can be derived from th 
principle of secondary reinforcement. 

Hypothesis: Exposure, to discrimi- 
native stimuli will have a reinforcing 
effect on the observing response to the 
extent that S has learned to respond 
differently to the two discriminative 
stimuli. - 

From this general hypothesis we 


derive the following specific hypothe- 


ses: 


1. po will increase (or remain high) 
under conditions of differential rein- 
forcement (discrimination training); 

2. po will decrease (or remain low) 
under conditions of nondifferential 
reinforcement; 
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3. When a well established discrim- 
ination is reversed, po will decrease 
temporarily and then recover; 

4. If the degree of discrimination 
and p, are both low, the formation of 
discrimination will be retarded for 
some interval but will finally occur 
quite rapidly. 


Evidence in support of these spe- 
cific hypotheses was obtained in an 
experiment in which an Ro, was meas- 
ured directly. 

This formulation may be useful for 
interpreting behavior in cases where 
changes in generalization between 
stimuli occur, and where the ease of 
formation of discrimination on the 
basis of some particular set of stimuli, 
changes as a function of training. 
Ss learn reversed discriminations more 
and more rapidly if reversals are pre- 
sented repeatedly. The present for- 
mulation offers a relatively simple and 
readily testable interpretation of this 
phenomenon. 

This formulation lends itself to pre- 
cise quantitative statement. A quan- 
titative analysis could be used in two 
ways: (1) to make quantitative pre- 
dictions of behavior based on some 
set of theoretical statements regard- 
ing the component learning processes, 
4nd.(2). to evaluate p, from observa- 
fions of Hieasurable aspects of effective 
respons), ‘The steps required for 
such an anabysis are outlined. 
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